Skip to content

[fix](storage) avoid duplicate exported compaction metrics#61420

Open
CSoulode wants to merge 3 commits intoapache:masterfrom
CSoulode:fix/issue-60791-compaction-metrics-duplicate-export
Open

[fix](storage) avoid duplicate exported compaction metrics#61420
CSoulode wants to merge 3 commits intoapache:masterfrom
CSoulode:fix/issue-60791-compaction-metrics-duplicate-export

Conversation

@CSoulode
Copy link

issue: #60791

Add missing labels for compaction task state and compaction io metrics to avoid duplicate Prometheus time series.

What problem does this PR solve?

Issue Number: close #60791

Related PR: #N/A

Problem Summary:

This PR fixes duplicate Prometheus time series exported by BE compaction metrics.

Although issue #60791 was reported in compute-storage decoupled mode, the root cause is not specific to cloud mode itself. The duplicate series come from shared BE metric definitions/export semantics, where multiple internal metrics were exported with the same metric name and the same label set.

Release note

None

Check List (For Author)

  • Test
    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason

Manual test:

  1. Start a local Doris cluster and prepare a tablet with compaction candidates.
  2. Trigger cumulative compaction:
curl -X POST "http://$BE_HOST:$BE_HTTP/api/compaction/run?tablet_id=$TABLET_ID&compact_type=cumulative"
  1. Check exported compaction task metrics:
curl -s "http://$BE_HOST:$BE_HTTP/metrics" \
  | grep '^doris_be_compaction_task_state_total' \
  | sed -E 's/ [^ ]+$//' \
  | sort \
  | uniq -c

Before:

2 doris_be_compaction_task_state_total{type="base"}
2 doris_be_compaction_task_state_total{type="cumulative"}

After:

1 doris_be_compaction_task_state_total{state="pending",type="base"}
1 doris_be_compaction_task_state_total{state="pending",type="cumulative"}
1 doris_be_compaction_task_state_total{state="running",type="base"}
1 doris_be_compaction_task_state_total{state="running",type="cumulative"}
  • Behavior changed:

    • No.
    • Yes. The exported metric labels in /metrics are changed to avoid duplicate Prometheus series.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

issue: apache#60791

Add missing labels for compaction task state and compaction io metrics
to avoid duplicate Prometheus time series.
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@CSoulode
Copy link
Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26810 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 36233087b6e50345100ff50f22badfb463bd0ac5, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17599	4638	4329	4329
q2	q3	10645	813	524	524
q4	4679	352	250	250
q5	7559	1222	1015	1015
q6	176	177	146	146
q7	783	847	677	677
q8	9302	1428	1307	1307
q9	4786	4706	4656	4656
q10	6251	1890	1662	1662
q11	464	263	258	258
q12	701	590	463	463
q13	18054	2939	2164	2164
q14	233	232	214	214
q15	q16	725	741	669	669
q17	730	840	433	433
q18	5993	5380	5316	5316
q19	1104	997	603	603
q20	529	494	381	381
q21	4375	1842	1409	1409
q22	334	470	334	334
Total cold run time: 95022 ms
Total hot run time: 26810 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4868	4662	4543	4543
q2	q3	3860	4341	3843	3843
q4	876	1199	781	781
q5	4098	4383	4345	4345
q6	186	175	138	138
q7	1761	1649	1551	1551
q8	2475	2707	2628	2628
q9	7613	7395	7396	7395
q10	3812	4063	3667	3667
q11	509	423	421	421
q12	495	595	456	456
q13	2736	3283	2343	2343
q14	277	307	273	273
q15	q16	737	758	706	706
q17	1149	1292	1324	1292
q18	7291	6803	6750	6750
q19	983	928	878	878
q20	2127	2165	1979	1979
q21	4100	3499	3321	3321
q22	481	426	400	400
Total cold run time: 50434 ms
Total hot run time: 47710 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 167912 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 36233087b6e50345100ff50f22badfb463bd0ac5, data reload: false

query5	4325	640	519	519
query6	331	222	198	198
query7	4223	466	261	261
query8	332	238	223	223
query9	8725	2729	2730	2729
query10	537	373	333	333
query11	6996	5085	4850	4850
query12	180	132	120	120
query13	1271	444	328	328
query14	5795	3692	3457	3457
query14_1	2813	2774	2773	2773
query15	205	194	174	174
query16	955	445	439	439
query17	880	697	600	600
query18	2427	452	349	349
query19	218	208	186	186
query20	135	128	127	127
query21	213	141	115	115
query22	13197	13776	14692	13776
query23	16381	15786	15621	15621
query23_1	15835	15603	15339	15339
query24	7195	1645	1225	1225
query24_1	1229	1234	1232	1232
query25	566	509	438	438
query26	1253	284	154	154
query27	2752	482	302	302
query28	4521	1883	1876	1876
query29	851	605	512	512
query30	301	230	195	195
query31	1011	967	877	877
query32	89	78	72	72
query33	531	357	296	296
query34	894	871	527	527
query35	637	699	627	627
query36	1111	1182	965	965
query37	138	102	84	84
query38	2973	2900	2881	2881
query39	879	830	829	829
query39_1	787	822	794	794
query40	253	159	145	145
query41	69	64	66	64
query42	264	261	264	261
query43	246	242	222	222
query44	
query45	200	189	186	186
query46	886	991	618	618
query47	2130	2154	2078	2078
query48	315	324	249	249
query49	661	476	404	404
query50	703	294	221	221
query51	4060	4086	4047	4047
query52	268	269	260	260
query53	296	335	285	285
query54	315	281	283	281
query55	100	92	84	84
query56	338	348	338	338
query57	1921	1833	1788	1788
query58	302	279	277	277
query59	2783	2971	2742	2742
query60	357	356	350	350
query61	174	178	177	177
query62	639	601	552	552
query63	314	291	281	281
query64	5111	1277	1004	1004
query65	
query66	1474	467	356	356
query67	24203	24243	24170	24170
query68	
query69	418	321	286	286
query70	996	987	940	940
query71	340	310	286	286
query72	2827	2684	2168	2168
query73	550	542	321	321
query74	9653	9545	9412	9412
query75	2872	2734	2437	2437
query76	2363	1029	665	665
query77	360	387	304	304
query78	11027	11163	10505	10505
query79	1092	828	568	568
query80	1354	616	557	557
query81	533	266	230	230
query82	1370	161	120	120
query83	356	262	244	244
query84	331	122	95	95
query85	984	490	469	469
query86	403	307	298	298
query87	3176	3101	2997	2997
query88	3582	2679	2685	2679
query89	422	369	344	344
query90	1807	180	177	177
query91	172	174	138	138
query92	79	73	69	69
query93	907	860	508	508
query94	522	308	291	291
query95	596	350	318	318
query96	646	533	230	230
query97	2470	2519	2386	2386
query98	241	246	224	224
query99	1025	992	919	919
Total cold run time: 249645 ms
Total hot run time: 167912 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] doris_be_compaction_task_state_total emitted twice in /metrics endpoint on BE nodes (compute-storage decoupled mode)

3 participants