-
Notifications
You must be signed in to change notification settings - Fork 4
/
feed.xml
1682 lines (1649 loc) · 131 KB
/
feed.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://ibug.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://ibug.io/" rel="alternate" type="text/html" /><updated>2024-10-02T16:02:27+00:00</updated><id>https://ibug.io/feed.xml</id><title type="html">iBug</title><subtitle>The little personal site for iBug</subtitle><author><name>iBug</name></author><entry><title type="html">Make Python 3.12 install user packages without complaints</title><link href="https://ibug.io/blog/2024/09/python3.12-user-packages/" rel="alternate" type="text/html" title="Make Python 3.12 install user packages without complaints" /><published>2024-09-02T00:00:00+00:00</published><updated>2024-09-03T11:59:51+00:00</updated><id>https://ibug.io/blog/2024/09/python3.12-user-packages</id><content type="html" xml:base="https://ibug.io/blog/2024/09/python3.12-user-packages/"><![CDATA[<p>I have a habit of <code class="language-plaintext highlighter-rouge">pip3 install --user</code> and then expecting these packages under <code class="language-plaintext highlighter-rouge">~/.local/lib/</code> to be available for my Python scripts whenever I need them. However, with PEP 668 landing in Python 3.12, I now have to add <code class="language-plaintext highlighter-rouge">--break-system-packages</code> even for <em>user</em> packages. This is super annoying considered that I have multiple projects sharing the same set of common packages (e.g. <a href="https://squidfunk.github.io/mkdocs-material/"><code class="language-plaintext highlighter-rouge">mkdocs-material</code></a>, a nice MkDocs theme). So time to tell <code class="language-plaintext highlighter-rouge">pip</code> to jerk off on that complaint.</p>
<p>Obviously, aliasing <code class="language-plaintext highlighter-rouge">pip3</code> (as per my personal habit, I always prefer <code class="language-plaintext highlighter-rouge">python3</code> and <code class="language-plaintext highlighter-rouge">pip3</code> over <code class="language-plaintext highlighter-rouge">python</code> and <code class="language-plaintext highlighter-rouge">pip</code>) to <code class="language-plaintext highlighter-rouge">pip3 --break-system-packages</code> could work, with all the limitations that any other shell alias bear.</p>
<p>The key here is, by examining how virtual environments work, we can trick Python into thinking that <code class="language-plaintext highlighter-rouge">~/.local</code> is one of them. This is already documented in the <a href="https://docs.python.org/3/library/site.html"><code class="language-plaintext highlighter-rouge">site</code> package</a>:</p>
<blockquote>
<p>If a file named <code class="language-plaintext highlighter-rouge">pyvenv.cfg</code> exists one directory above <code class="language-plaintext highlighter-rouge">sys.executable</code> …</p>
</blockquote>
<p>So here’s the solution, assuming <code class="language-plaintext highlighter-rouge">~/.local/bin</code> is already in your <code class="language-plaintext highlighter-rouge">$PATH</code>:</p>
<ol>
<li>Symlink <code class="language-plaintext highlighter-rouge">/usr/bin/python3</code> to <code class="language-plaintext highlighter-rouge">~/.local/bin/python3</code></li>
<li>Copy <code class="language-plaintext highlighter-rouge">/usr/bin/pip3</code> to <code class="language-plaintext highlighter-rouge">~/.local/bin/pip3</code>, and change the shebang line to <code class="language-plaintext highlighter-rouge">#!/home/example/.local/bin/python3</code> (you’ll have to use the absolute path here, though).</li>
<li>
<p>Create <code class="language-plaintext highlighter-rouge">~/.local/pyvenv.cfg</code> with just one line of content:</p>
<div class="language-ini highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code> <span class="py">include-system-site-packages</span> <span class="p">=</span> <span class="s">true</span>
</code></pre>
</div>
</div>
<p>You can, of course, add other settings for <code class="language-plaintext highlighter-rouge">venv</code>, which is completely optional and up to you.</p>
</li>
</ol>
<p>Now whenever you install something with <code class="language-plaintext highlighter-rouge">pip3</code>, it’ll happily install it under <code class="language-plaintext highlighter-rouge">~/.local/lib/python3.12/site-packages</code> even without the need for <code class="language-plaintext highlighter-rouge">--user</code>.</p>
<p>If you prefer <code class="language-plaintext highlighter-rouge">python</code> or <code class="language-plaintext highlighter-rouge">pip</code> commands, you can just change the file names in the above steps accordingly.</p>
<p>Noteworthy points are:</p>
<ol>
<li>This method certainly can work for the system Python installation (at <code class="language-plaintext highlighter-rouge">/usr</code>), which I wouldn’t recommend for obvious reasons. If you insist, you should at least do this under <code class="language-plaintext highlighter-rouge">/usr/local</code> instead.</li>
<li>
<s>This method (when applied to `~/.local`) is ineffective against scripts already shebanged with `#!/usr/bin/python3`. Consider developing the habit of running `python3 script.py` instead of `./script.py` like I do.</s>
</li>
</ol>
<h2 id="corrigenda">Corrigenda</h2>
<ul>
<li>Scripts shebanged with <code class="language-plaintext highlighter-rouge">#!/usr/bin/python3</code> <em>will</em> pick up packages under <code class="language-plaintext highlighter-rouge">~/.local/lib/</code> as this is the default user site directory, which comes in <code class="language-plaintext highlighter-rouge">sys.path</code> even before the system site directory.</li>
</ul>
]]></content><author><name>iBug</name></author><category term="linux" /><category term="python" /><summary type="html"><![CDATA[I have a habit of pip3 install --user and then expecting these packages under ~/.local/lib/ to be available for my Python scripts whenever I need them. However, with PEP 668 landing in Python 3.12, I now have to add --break-system-packages even for user packages. This is super annoying considered that I have multiple projects sharing the same set of common packages (e.g. mkdocs-material, a nice MkDocs theme). So time to tell pip to jerk off on that complaint.]]></summary></entry><entry><title type="html">镜像站 ZFS 调优实践</title><link href="https://ibug.io/blog/2024/08/nju-talk/" rel="alternate" type="text/html" title="镜像站 ZFS 调优实践" /><published>2024-08-17T00:00:00+00:00</published><updated>2024-10-03T00:01:10+00:00</updated><id>https://ibug.io/blog/2024/08/nju-talk</id><content type="html" xml:base="https://ibug.io/blog/2024/08/nju-talk/"><![CDATA[<section id="title">
<h1 class="title">2000 元的机械硬盘 > 3000 元的固态硬盘?</h1>
<h2 style="font-weight: normal;">A.K.A. 镜像站 ZFS 调优实践</h2>
<hr />
<p class="date">iBug @ USTC</p>
<p class="date">2024 年 8 月 17 日<br />
南京大学 开源软件论坛</p>
</section>
<section>
<section id="background">
<h2>USTC Mirrors</h2>
<ul>
<li>日均服务量:(2024-05 ~ 2024-06)
<ul>
<li>出流量 ~36 TiB</li>
<li>HTTP 请求数 17M,响应流量 19 TiB</li>
<li>Rsync 请求数 147.8K(21.8K),输出流量 10.3 TiB</li>
</ul>
</li>
<li>极限情况的仓库容量:
<ul>
<li>HTTP 服务器(XFS):63.3 TiB / 66.0 TiB (96%, 2023-12-18)</li>
<li>Rsync 服务器(ZFS):42.4 TiB / 43.2 TiB (98%, 2023-11-21)</li>
</ul>
</li>
</ul>
</section>
<section id="background-2">
<h2>背景</h2>
<ul>
<li>HTTP 服务器:
<ul>
<li>2020 年下半年搭建</li>
<li>10 TB <i class="fas fa-compact-disc fa-spin"></i> × 12</li>
<li>2 TB <i class="fas fa-floppy-disk"></i> × 1</li>
<li>XFS on LVM on HW RAID</li>
<li>考虑到 XFS 不能缩,VG 留了 free PE</li>
</ul>
</li>
<li>Rsync 服务器:
<ul>
<li>2016 年下半年搭建</li>
<li>6 TB <i class="fas fa-compact-disc fa-spin"></i> × 12</li>
<li>240 GB <i class="fas fa-floppy-disk"></i> × 2 + 480 GB <i class="fas fa-floppy-disk"></i> × 1 (Optane 900p)</li>
<li>RAID-Z3(8 data + 3 parity + 1 hot spare)</li>
<li>全默认参数(除了 <code>zfs_arc_max</code>)</li>
</ul>
</li>
</ul>
<p>硬盘 I/O 日常 > 90%,校内下载 iso 不足 50 MB/s</p>
</section>
<section id="background-2-image">
<div class="img-container">
<img src="https://image.ibugone.com/grafana/mirrors-io-utilization-may-2024.png" />
<p>USTC 镜像站两台服务器在 2024 年 5 月期间的磁盘负载</p>
</div>
</section>
</section>
<section>
<section id="zfs">
<h2>ZFS</h2>
<ul>
<li>单机存储的终极解决方案</li>
<li>集 RAID、LVM、FS 于一体</li>
<li>所有数据都有 checksum</li>
<li><s>Fire and forget</s></li>
<li>好多参数啊</li>
</ul>
</section>
<section id="zfs-learning">
<h3>前期学习与实验</h3>
<ul>
<li><s>从(另一个)老师那嫖了点盘</s>装上了 ZFS,用于 研 究 学 习</li>
<li>I/O 负载来源?<s>上 PT 站</s></li>
<li>练习时长两年半的成果:<i class="fas fa-arrow-up"></i> 1.20 PiB, <i class="fas fa-arrow-down"></i> 1.83 TiB</li>
</ul>
<hr />
<p>重要学习资料:</p>
<ul>
<li><a href="https://utcc.utoronto.ca/~cks/space/blog/">Chris Siebenmann's blog</a></li>
<li><a href="https://openzfs.github.io/openzfs-docs/">OpenZFS Documentation</a></li>
<li>iBug's blog: <a href="/p/62">Understanding ZFS block sizes</a> (<a href="/p/62">ibug.io/p/62</a>)
<ul>
<li>以及这篇 blog 底下的众多参考文献</li>
</ul>
</li>
</ul>
</section>
<section id="zfs-image">
<div class="img-container">
<img src="https://image.ibugone.com/grafana/qb/2024-06-05.png" />
<p>好像给什么奇怪的东西加入了 Grafana</p>
</div>
</section>
</section>
<section>
<section id="mirrors">
<h2>镜像站</h2>
<ul>
<li>提供文件下载服务</li>
<li><s>也提供「家庭宽带上下行流量比例平衡」服务</s></li>
<li>读多写少,并且几乎所有操作都是整个文件顺序读写</li>
<li>少量的数据损坏没啥不良后果</li>
</ul>
</section>
<section id="mirrors-file-distrib">
<div class="img-container">
<img src="https://image.ibugone.com/server/mirrors-file-size-distribution-2024-08.png" />
<p>2024 年 8 月 USTC 镜像仓库内的文件大小分布
<br>
其中中位数为 9.83 KiB,平均大小为 1.60 MiB</p>
</div>
</section>
<section id="mirrors2">
<h3>重建 Rsync 服务器</h3>
<ul>
<li>RAID-Z3 的 overhead 较高,而且拆成两组 RAID-Z2 = 两倍的 IOPS</li>
<li>镜像站调优计划:
<ul>
<li><code>recordsize=1M</code>:反正都是全文件顺序读</li>
<li><code>compression=zstd</code>:至少可以压掉 > 1M 文件的 padding
<ul>
<li>OpenZFS 2.2 将 early abort 机制推广到了 Zstd 3+,不必担心性能问题</li>
</ul>
</li>
<li><code>xattr=off</code>:谁家镜像需要 xattr?</li>
<li><code>atime=off</code>, <code>setuid=off</code>, <code>exec=off</code>, <code>devices=off</code>:开着干啥?</li>
<li><code>secondarycache=metadata</code>:Rsync 就别来消耗固态寿命了</li>
</ul>
</li>
<li>Danger Zone:
<ul>
<li><code>sync=disabled</code>:囤到 <code>zfs_txg_timeout</code> 再写盘</li>
<li><code>redundant_metadata=some</code>:偶尔坏个文件也没事</li>
</ul>
</li>
<li>Full version: <a href="https://docs.ustclug.org/services/mirrors/zfs/#setup">LUG @ USTC Documentation</a></li>
</ul>
</section>
<section id="zfs-parameters">
<h3>ZFS 参数</h3>
<ul>
<li>290+ 参数不能个个都学习嘛(感谢 Aron Xu @ BFSU)</li>
<li>ARC 容量:
<pre><code class="language-sh" data-trim>
# Set ARC size to 160-200 GiB, keep 16 GiB free for OS
options zfs zfs_arc_max=214748364800
options zfs zfs_arc_min=171798691840
options zfs zfs_arc_sys_free=17179869184
</code></pre>
</li>
<li>ARC 内容:
<pre><code class="language-sh" data-trim>
# Favor metadata to data by 20x (OpenZFS 2.2+)
options zfs zfs_arc_meta_balance=2000
# Allow up to 80% of ARC to be used for dnodes
options zfs zfs_arc_dnode_limit_percent=80
</code></pre>
</li>
<li>I/O 队列深度:
<pre><code class="language-sh" data-trim>
# See man page section "ZFS I/O Scheduler"
options zfs zfs_vdev_async_read_max_active=8
options zfs zfs_vdev_async_read_min_active=2
options zfs zfs_vdev_scrub_max_active=5
options zfs zfs_vdev_max_active=20000
</code></pre>
</li>
<li>Full version: <a href="https://docs.ustclug.org/services/mirrors/zfs/#zfs-kernel-module">LUG @ USTC Documentation</a></li>
</ul>
</section>
<section id="mirrors2-rebuild-results">
<h3>重建成果</h3>
<ul>
<li>略感惊喜的压缩率:39.5T / 37.1T = 1.07x
<ul>
<li>正确用法:<code>zfs list -po name,logicalused,used</code></li>
<li>实际压缩率:1 + 6.57%(-2.67 TB / -2.43 TiB)</li>
<li><s>等于删了 <a href="https://image.ibugone.com/teaser/lenovo-legion-wechat-data.jpg">9 份微信数据</a></s></li>
</ul>
</li>
<li>合理的磁盘 I/O</li>
</ul>
</section>
<section id="mirrors2-io-image">
<div class="img-container">
<img src="https://image.ibugone.com/grafana/mirrors2-io-utilization-and-free-space-june-july-2024.png" />
<p>重建前后 Rsync 服务器的磁盘负载与空闲空间比较</p>
</div>
</section>
</section>
<section>
<section id="mirrors4">
<h2>HTTP 服务器</h2>
<ul>
<li>硬件 RAID + LVM + XFS + Kernel page cache(开箱即用?)</li>
<li>SSD?LVMCache!
<ul>
<li>1M extents? Block size? Algorithm?</li>
<li><i class="fas fa-skull"></i> GRUB2</li>
<li><i class="fas fa-skull"></i> "oldssd"</li>
</ul>
</li>
<li>XFS 不能缩,所以 VG 和 FS 两层都要留空间备用</li>
</ul>
</section>
<section id="mirrors4-dmcache-image">
<div class="img-container">
<img src="https://image.ibugone.com/grafana/mirrors4-dmcache-may-june-2024.png" />
<p>重建前 HTTP 服务器采用的 LVMcache 方案的命中率</p>
</div>
</section>
<section id="mirrors4-rebuild">
<h2>如法炮制</h2>
<ul>
<li>体验一下更加先进的 kernel:<code>6.8.8-3-pve</code>(无需 DKMS 哦)</li>
<li>重建为两组 RAID-Z2,开压缩
<ul>
<li>面向 HTTP 用户的服务器,所以 <code>secondarycache=all</code>(放着不动)</li>
<li>更好的 CPU,所以 <code>compression=zstd-8</code></li>
</ul>
</li>
<li>更快的 <code>zfs send -Lcp</code>:36 小时倒完 50+ TiB 仓库</li>
<li>压缩率:1 + 3.93%(-2.42 TB / -2.20 TiB)</li>
</ul>
</section>
<section id="mirrors2-4-io-image">
<div class="img-container">
<img src="https://image.ibugone.com/grafana/mirrors2-4-io-utilization-june-july-2024.png" />
<p>重建前后两台服务器的磁盘负载比较
<br>
左边为重建前,中间为仅 Rsync 服务器重建后,右边为两台服务器均重建后的负载</p>
</div>
</section>
<section id="mirrors2-4-zfs-arc-image">
<div class="img-container">
<img src="https://image.ibugone.com/grafana/mirrors2-4-zfs-arc-hit-rate.png" />
<p>两台服务器的 ZFS ARC 命中率</p>
</div>
</section>
<section id="mirrors2-4-recent-io-image">
<div class="img-container">
<img src="https://image.ibugone.com/grafana/mirrors2-4-disk-io-after-rebuild.png" />
<p>两台服务器重建后稳定的磁盘利用率</p>
</div>
</section>
</section>
<section>
<section id="misc">
<h2>杂项</h2>
</section>
<section id="zfs-compressratio">
<h3>ZFS 压缩率</h3>
<table>
<thead>
<tr>
<th>NAME</th>
<th>LUSED</th>
<th>USED</th>
<th>RATIO</th>
</tr>
</thead>
<tbody>
<tr>
<td>pool0/repo/crates.io-index</td>
<td>2.19G</td>
<td>1.65G</td>
<td>3.01x</td>
</tr>
<tr>
<td>pool0/repo/elpa</td>
<td>3.35G</td>
<td>2.32G</td>
<td>1.67x</td>
</tr>
<tr>
<td>pool0/repo/rfc</td>
<td>4.37G</td>
<td>3.01G</td>
<td>1.56x</td>
</tr>
<tr>
<td>pool0/repo/debian-cdimage</td>
<td>1.58T</td>
<td>1.04T</td>
<td>1.54x</td>
</tr>
<tr>
<td>pool0/repo/tldp</td>
<td>4.89G</td>
<td>3.78G</td>
<td>1.48x</td>
</tr>
<td>pool0/repo/loongnix</td>
<td>438G</td>
<td>332G</td>
<td>1.34x</td>
</tr>
<tr>
<td>pool0/repo/rosdistro</td>
<td>32.2M</td>
<td>26.6M</td>
<td>1.31x</td>
</tr>
<tr>
</tbody>
</table>
<p>我数学不好:<a href="https://github.com/openzfs/zfs/issues/7639"><i class="fab fa-github"></i> openzfs/zfs#7639</a></p>
</section>
<section id="zfs-compressratio-diff">
<h3>ZFS 压缩量</h3>
<table>
<thead>
<tr>
<th>NAME</th>
<th>LUSED</th>
<th>USED</th>
<th>DIFF</th>
</tr>
</thead>
<tbody>
<tr>
<td>pool0/repo</td>
<td>58.3T</td>
<td>56.1T</td>
<td>2.2T</td>
</tr>
<tr>
<td>pool0/repo/debian-cdimage</td>
<td>1.6T</td>
<td>1.0T</td>
<td>549.6G</td>
</tr>
<tr>
<td>pool0/repo/opensuse</td>
<td>2.5T</td>
<td>2.3T</td>
<td>279.7G</td>
</tr>
<tr>
<td>pool0/repo/turnkeylinux</td>
<td>1.2T</td>
<td>1.0T</td>
<td>155.2G</td>
</tr>
<tr>
<td>pool0/repo/loongnix</td>
<td>438.2G</td>
<td>331.9G</td>
<td>106.3G</td>
</tr>
<tr>
<td>pool0/repo/alpine</td>
<td>3.0T</td>
<td>2.9T</td>
<td>103.9G</td>
</tr>
<tr>
<td>pool0/repo/openwrt</td>
<td>1.8T</td>
<td>1.7T</td>
<td>70.0G</td>
</tr>
</tbody>
</table>
</section>
<section id="grafana-zfs-io">
<h3>Grafana I/O 统计</h3>
<pre><code data-trim>
SELECT
non_negative_derivative(sum("reads"), 1s) AS "read",
non_negative_derivative(sum("writes"), 1s) AS "write"
FROM (
SELECT
first("reads") AS "reads",
first("writes") AS "writes"
FROM "zfs_pool"
WHERE ("host" = 'taokystrong' AND "pool" = 'pool0') AND $timeFilter
GROUP BY time($interval), "host"::tag, "pool"::tag, "dataset"::tag fill(null)
)
WHERE $timeFilter
GROUP BY time($interval), "pool"::tag fill(linear)
</code></pre>
<p>跑得有点慢(毕竟要先 <code>GROUP BY</code> 每个 ZFS dataset 再一起 <code>sum</code>)</p>
<p>I/O 带宽:把里层的 <code>reads</code> 和 <code>writes</code> 换成 <code>nread</code> 和 <code>nwritten</code> 即可</p>
</section>
<section id="grafana-zfs-io-image">
<div class="img-container">
<img src="https://image.ibugone.com/grafana/mirrors2-4-zfs-io-count.png" />
</div>
<p></p>
<ul>
<li>如何用机械盘跑出平均 15K、最高 50K 的 IOPS?</li>
<li><s>把 ARC hit 算进去</s></li>
</ul>
</section>
</section>
<section>
<section id="hearse">
<h2>灵车时间</h2>
</section>
<section id="pve-kernel">
<h2>Proxmox Kernel</h2>
<ul>
<li>≈ Ubuntu Kernel</li>
<li><i class="fas fa-skull"></i> Rsync 容器</li>
<li><code>security/apparmor/af_unix.c</code>???</li>
<li><a href="https://docs.ustclug.org/faq/apparmor/">LUG Documentation: AppArmor</a></li>
</ul>
<pre><code class="language-sh" data-trim>
dpkg-divert --package lxc-pve --rename --divert /usr/share/apparmor-features/features.stock --add /usr/share/apparmor-features/features
wget -O /usr/share/apparmor-features/features https://github.com/proxmox/lxc/raw/master/debian/features
</code></pre>
</section>
<section id="zerotier-data">
<div class="img-container">
<img src="https://image.ibugone.com/server/ls-zerotier-redhar-el.png" />
<p>ZeroTier 仓库中一眼重复的内容</p>
</div>
</section>
<section id="dedup">
<h3>Dedup!</h3>
<pre><code class="language-sh">zfs create -o dedup=on pool0/repo/zerotier</code></pre>
<pre><code class="language-sh" data-trim>
# zdb -DDD pool0
dedup = 4.93, compress = 1.23, copies = 1.00, dedup * compress / copies = 6.04
</code></pre>
<p>效果倒是不错,但是不想像 ZFS dedup 这么灵怎么办?</p>
</section>
<section id="jdupes">
<h3>jdupes</h3>
<pre><code class="language-sh" data-trim>
# post-sync.sh
# Do file-level deduplication for select repos
case "$NAME" in
docker-ce|influxdata|nginx|openresty|proxmox|salt|tailscale|zerotier)
jdupes -L -Q -r -q "$DIR" ;;
esac
</code></pre>
</section>
<section id="jdupes-table">
<h3>jdupes 效果</h3>
<table>
<thead>
<tr>
<th>Name</th>
<th>Orig</th>
<th>Dedup</th>
<th>Diff</th>
<th>Ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td>proxmox</td>
<td>395.4G</td>
<td>162.6G</td>
<td>232.9G</td>
<td>2.43x</td>
</tr>
<tr>
<td>docker-ce</td>
<td>539.6G</td>
<td>318.2G</td>
<td>221.4G</td>
<td>1.70x</td>
</tr>
<tr>
<td>influxdata</td>
<td>248.4G</td>
<td>54.8G</td>
<td>193.6G</td>
<td>4.54x</td>
</tr>
<tr>
<td>salt</td>
<td>139.0G</td>
<td>87.2G</td>
<td>51.9G</td>
<td>1.59x</td>
</tr>
<tr>
<td>nginx</td>
<td>94.9G</td>
<td>59.7G</td>
<td>35.2G</td>
<td>1.59x</td>
</tr>
<tr>
<td>zerotier</td>
<td>29.8G</td>
<td>6.1G</td>
<td>23.7G</td>
<td>4.88x</td>
</tr>
<tr>
<td>mysql-repo</td>
<td>647.8G</td>
<td>632.5G</td>
<td>15.2G</td>
<td>1.02x</td>
</tr>
<tr>
<td>openresty</td>
<td>65.1G</td>
<td>53.4G</td>
<td>11.7G</td>
<td>1.22x</td>
</tr>
<tr>
<td>tailscale</td>
<td>17.9G</td>
<td>9.0G</td>
<td>9.0G</td>
<td>2.00x</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="conclusion">
<h2>只要 ZFS 用得好</h2>
<ul>
<li>妈妈再也不用担心我的硬盘分区</li>
<li>机械硬盘 <s>比西方的固态硬盘跑得还快</s></li>
<li>成为第一个不再<b>羡慕</b> TUNA 全闪的镜像站</li>
<li>免费的额外容量
<ul>
<li>Dedup 会员红包</li>
</ul>
</li>
<li>碎片率?</li>
</ul>
</section>
<section id="outro">
<h1>谢谢!</h1>
<small>
<p>本页面的链接:<a href="/p/72"><i class="fas fa-fw fa-link"></i> ibug.io/p/72</a></p>
<p>友情链接:2023 年南京大学报告:<a href="/p/59"><i class="fas fa-fw fa-link"></i> ibug.io/p/59</a></p>
</small>
</section>
]]></content><author><name>iBug</name></author></entry><entry><title type="html">Why my IPv4 gets stuck? - Debugging network issues with bpftrace</title><link href="https://ibug.io/blog/2024/08/first-touch-bpftrace/" rel="alternate" type="text/html" title="Why my IPv4 gets stuck? - Debugging network issues with bpftrace" /><published>2024-08-03T00:00:00+00:00</published><updated>2024-08-03T03:22:32+00:00</updated><id>https://ibug.io/blog/2024/08/first-touch-bpftrace</id><content type="html" xml:base="https://ibug.io/blog/2024/08/first-touch-bpftrace/"><![CDATA[<p>I run a Debian-based software router on my home network. It’s connected to multiple ISPs, so I have some policy routing rules to balance the traffic between them. Some time ago, I noticed that the IPv4 connectivity got stuck intermittently when it didn’t use to, while IPv6 was working fine. It’s also interesting that the issue only happened with one specific ISP, in the egress direction, and only a few specific devices were affected.</p>
<p>At first I suspected the ISP’s equipment, but a clue quickly dismissed that suspicion: Connection to the same ISP worked fine when initiated from the router itself, as well as many other unaffected devices. So the issue must be within the router.</p>
<p>As usual, every network debugging begins with a packet capture. I start <code class="language-plaintext highlighter-rouge">tcpdump</code> on both the LAN interface and the problematic WAN interface, then try <code class="language-plaintext highlighter-rouge">curl</code>-ing something from an affected device. Packet capture shows a few back-and-forth packets, then the device keeps sending packets but the router doesn’t forward them to the WAN interface anymore. Time for a closer look.</p>
<h2 id="identifying-the-issue">Identifying the issue</h2>
<p>On an affected device, <code class="language-plaintext highlighter-rouge">curl</code> gets stuck somewhere in the middle:</p>
<div class="language-console highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>curl <span class="nt">-vso</span> /dev/null https://www.cloudflare.com/
<span class="go">* Trying 104.16.124.96:443...
</span><span class="gp">* Connected to www.cloudflare.com (104.16.124.96) port 443 (#</span>0<span class="o">)</span>
<span class="go">* ALPN: offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
^C
</span></code></pre>
</div>
</div>
<p><code class="language-plaintext highlighter-rouge">tcpdump</code> shows nothing special:</p>
<div class="language-console highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>tcpdump <span class="nt">-ni</span> any <span class="s1">'host 104.16.124.96 and tcp port 443'</span>
<span class="gp">02:03:47.403905 lan0 In IP 172.17.0.2.49194 ></span><span class="w"> </span>104.16.124.96.443: Flags <span class="o">[</span>S], <span class="nb">seq </span>1854398703, win 65535, options <span class="o">[</span>mss 1460,sackOK,TS val 1651776756 ecr 0,nop,wscale 10], length 0
<span class="gp">02:03:47.403956 ppp0 Out IP 10.250.193.4.49194 ></span><span class="w"> </span>104.16.124.96.443: Flags <span class="o">[</span>S], <span class="nb">seq </span>1854398703, win 65535, options <span class="o">[</span>mss 1432,sackOK,TS val 1651776756 ecr 0,nop,wscale 10], length 0
<span class="gp">02:03:47.447663 ppp0 In IP 104.16.124.96.443 ></span><span class="w"> </span>10.250.193.4.49194: Flags <span class="o">[</span>S.], <span class="nb">seq </span>1391350792, ack 1854398704, win 65535, options <span class="o">[</span>mss 1460,sackOK,TS val 141787839 ecr 1651776756,nop,wscale 13], length 0
<span class="gp">02:03:47.447696 lan0 Out IP 104.16.124.96.443 ></span><span class="w"> </span>172.17.0.2.49194: Flags <span class="o">[</span>S.], <span class="nb">seq </span>1391350792, ack 1854398704, win 65535, options <span class="o">[</span>mss 1460,sackOK,TS val 141787839 ecr 1651776756,nop,wscale 13], length 0
<span class="gp">02:03:47.447720 lan0 In IP 172.17.0.2.49194 ></span><span class="w"> </span>104.16.124.96.443: Flags <span class="o">[</span>.], ack 1, win 64, options <span class="o">[</span>nop,nop,TS val 1651776800 ecr 141787839], length 0
<span class="gp">02:03:47.452705 lan0 In IP 172.17.0.2.49194 ></span><span class="w"> </span>104.16.124.96.443: Flags <span class="o">[</span>P.], <span class="nb">seq </span>1:518, ack 1, win 64, options <span class="o">[</span>nop,nop,TS val 1651776804 ecr 141787839], length 517
<span class="gp">02:03:47.452751 ppp0 Out IP 10.250.193.4.49194 ></span><span class="w"> </span>104.16.124.96.443: Flags <span class="o">[</span>P.], <span class="nb">seq </span>1:518, ack 1, win 64, options <span class="o">[</span>nop,nop,TS val 1651776804 ecr 141787839], length 517
<span class="gp">02:03:47.496507 ppp0 In IP 104.16.124.96.443 ></span><span class="w"> </span>10.250.193.4.49194: Flags <span class="o">[</span>.], ack 518, win 9, options <span class="o">[</span>nop,nop,TS val 141787888 ecr 1651776804], length 0
<span class="gp">02:03:47.496527 lan0 Out IP 104.16.124.96.443 ></span><span class="w"> </span>172.17.0.2.49194: Flags <span class="o">[</span>.], ack 518, win 9, options <span class="o">[</span>nop,nop,TS val 141787888 ecr 1651776804], length 0
<span class="gp">02:03:47.498147 ppp0 In IP 104.16.124.96.443 ></span><span class="w"> </span>10.250.193.4.49194: Flags <span class="o">[</span>P.], <span class="nb">seq </span>1:2737, ack 518, win 9, options <span class="o">[</span>nop,nop,TS val 141787890 ecr 1651776804], length 2736
<span class="gp">02:03:47.498165 lan0 Out IP 104.16.124.96.443 ></span><span class="w"> </span>172.17.0.2.49194: Flags <span class="o">[</span>P.], <span class="nb">seq </span>1:2737, ack 518, win 9, options <span class="o">[</span>nop,nop,TS val 141787890 ecr 1651776804], length 2736
<span class="gp">02:03:47.498175 lan0 In IP 172.17.0.2.49194 ></span><span class="w"> </span>104.16.124.96.443: Flags <span class="o">[</span>.], ack 2737, win 70, options <span class="o">[</span>nop,nop,TS val 1651776850 ecr 141787890], length 0
<span class="gp">02:03:47.498195 ppp0 In IP 104.16.124.96.443 ></span><span class="w"> </span>10.250.193.4.49194: Flags <span class="o">[</span>P.], <span class="nb">seq </span>2737:3758, ack 518, win 9, options <span class="o">[</span>nop,nop,TS val 141787890 ecr 1651776804], length 1021
<span class="gp">02:03:47.498228 ppp0 Out IP 10.250.193.4.49194 ></span><span class="w"> </span>104.16.124.96.443: Flags <span class="o">[</span>R], <span class="nb">seq </span>1854399221, win 0, length 0
<span class="go">^C
711 packets captured
720 packets received by filter
0 packets dropped by kernel
</span></code></pre>
</div>
</div>
<p>Considering the complexity of the policy routing, I tried inspecting conntrack status in parallel. Nothing unusual there either, until I tried matching conntrack events with <code class="language-plaintext highlighter-rouge">tcpdump</code>:</p>
<div class="language-console highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>conntrack <span class="nt">-E</span> <span class="nt">-s</span> 172.17.0.2 <span class="nt">-p</span> tcp <span class="nt">--dport</span> 443 2>/dev/null | ts %.T
<span class="go">02:03:47.404103 [NEW] tcp 6 120 SYN_SENT src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 [UNREPLIED] src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194
02:03:47.447748 [UPDATE] tcp 6 60 SYN_RECV src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 mark=48
02:03:47.447843 [DESTROY] tcp 6 432000 ESTABLISHED src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 [ASSURED] mark=48
02:03:47.452798 [NEW] tcp 6 300 ESTABLISHED src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 [UNREPLIED] src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194
02:03:47.496572 [UPDATE] tcp 6 300 src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 mark=48
02:03:47.498195 [UPDATE] tcp 6 300 src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 [ASSURED] mark=48
02:03:47.498243 [DESTROY] tcp 6 432000 ESTABLISHED src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 [ASSURED] mark=48
^C
</span></code></pre>
</div>
</div>
<p>With <code class="language-plaintext highlighter-rouge">ts</code> (from <a href="https://packages.debian.org/stable/moreutils"><code class="language-plaintext highlighter-rouge">moreutils</code></a>) adding timestamps to conntrack events, I can see that the conntrack entry is destroyed right after (+123μs) the second packet comes in from the device. Subsequent packets causes (+93μs) the same conntrack entry to be recreated, so <code class="language-plaintext highlighter-rouge">curl</code> could somehow complete the SSL handshake to a point where it only sends one packet and nothing afterwards for the connection to be recreated for a third time.</p>
<p>Clearly the second packet should be considered <code class="language-plaintext highlighter-rouge">ESTABLISHED</code> by conntrack and makes no sense to trigger a <code class="language-plaintext highlighter-rouge">DESTROY</code> event. I’m at a loss here and start trying random things hoping to find a clue. I tried downgrading the kernel to 5.10 (from Bullseye) and upgrading to 6.9 (from Bookworm backports), but nothing changed, eliminating the possibility of a kernel bug.</p>
<p>After scrutinizing my firewall rules, I noticed a small difference between IPv4 and IPv6 rules:</p>
<div class="language-shell highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c"># rules.v4</span>
<span class="k">*</span>nat
<span class="c"># ...</span>
<span class="nt">-A</span> POSTROUTING <span class="nt">-o</span> ppp+ <span class="nt">-j</span> MASQUERADE
COMMIT
<span class="k">*</span>mangle
:PREROUTING ACCEPT <span class="o">[</span>0:0]
<span class="c"># ...</span>
<span class="nt">-A</span> PREROUTING <span class="nt">-j</span> CONNMARK <span class="nt">--restore-mark</span>
<span class="nt">-A</span> PREROUTING <span class="nt">-m</span> mark <span class="o">!</span> <span class="nt">--mark</span> 0 <span class="nt">-j</span> ACCEPT
<span class="c">#A PREROUTING -m conntrack --ctstate NEW,RELATED -j MARK --set-xmark 0x100/0x100</span>
<span class="nt">-A</span> PREROUTING <span class="nt">-m</span> mark <span class="nt">--mark</span> 0/0xff <span class="nt">-j</span> ExtraConn
<span class="nt">-A</span> PREROUTING <span class="nt">-m</span> mark <span class="nt">--mark</span> 0/0xff <span class="nt">-j</span> IntraConn
<span class="nt">-A</span> PREROUTING <span class="nt">-m</span> mark <span class="nt">--mark</span> 0/0xff <span class="nt">-j</span> MARK <span class="nt">--set-xmark</span> 0x30/0xff
<span class="nt">-A</span> PREROUTING <span class="nt">-j</span> CONNMARK <span class="nt">--save-mark</span>
<span class="c"># ...</span>
<span class="nt">-A</span> ExtraConn <span class="nt">-i</span> ppp0 <span class="nt">-j</span> MARK <span class="nt">--set-xmark</span> 0x30/0xff
<span class="c"># ...</span>
<span class="nt">-A</span> IntraConn <span class="nt">-s</span> 172.17.0.2/32 <span class="nt">-j</span> iBugOptimized
<span class="c"># ...</span>
<span class="nt">-A</span> iBugOptimized <span class="nt">-j</span> MARK <span class="nt">--set-xmark</span> 0x36/0xff
<span class="nt">-A</span> iBugOptimized <span class="nt">-j</span> ACCEPT
COMMIT
</code></pre>
</div>
</div>
<p>However, <code class="language-plaintext highlighter-rouge">rules.v6</code> is missing the last rule in <code class="language-plaintext highlighter-rouge">iBugOptimized</code>, and IPv6 is somehow exempt from the conntrack issue. Removing this extra <code class="language-plaintext highlighter-rouge">ACCEPT</code> rule from <code class="language-plaintext highlighter-rouge">rules.v4</code> fully restores the connectivity. So this is certainly the cause, but how is it related to the actual issue?</p>
<h2 id="investigation">Investigation</h2>
<p><em>I know there are some decent tools on GitHub that aids in debugging iptables, which is notorious for its complexity. But since I wrote the entire firewall rule set and am still maintaining it by hand, I’m going for the hard route of watching and understanding every single rule.</em></p>
<p>The difference for that single <code class="language-plaintext highlighter-rouge">ACCEPT</code> rule is, it skips the <code class="language-plaintext highlighter-rouge">--save-mark</code> step, so the assigned firewall mark is not saved to its corresponding conntrack entry. When a reply packet comes in, conntrack has nothing for the <code class="language-plaintext highlighter-rouge">--restore-mark</code> step, so the packet gets assigned the “default” mark of <code class="language-plaintext highlighter-rouge">0x30</code> and <em>then</em> this value gets saved. I should have noticed the wrong conntrack mark earlier, as <code class="language-plaintext highlighter-rouge">conntrack -L</code> clearly showed a mark of 48 instead of the intended 54 (<code class="language-plaintext highlighter-rouge">0x36</code> from <code class="language-plaintext highlighter-rouge">iBugOptimized</code>). This narrows the cause down to a discrepancy between the packet mark and the conntrack mark.</p>
<p>Firewall marks are a more flexible way to implement slightly complicated policy-based routing, as it defers the routing decision to the <code class="language-plaintext highlighter-rouge">mangle/PREROUTING</code> chain instead of the single-chain global routing rules. In my case, every ISP gets assigned a fwmark routing rule like this:</p>
<div class="language-text highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>9: from all fwmark 0x30/0xff lookup eth0 proto static
9: from all fwmark 0x31/0xff lookup eth1 proto static
9: from all fwmark 0x36/0xff lookup ppp0 proto static
</code></pre>
</div>
</div>
<p>Presumably, subsequent packets from the same connection should be routed to <code class="language-plaintext highlighter-rouge">eth0</code> because it has the mark <code class="language-plaintext highlighter-rouge">0x30</code> restored from conntrack entry. This is not the case, however, as <code class="language-plaintext highlighter-rouge">tcpdump</code> shows nothing on <code class="language-plaintext highlighter-rouge">eth0</code> and everything on <code class="language-plaintext highlighter-rouge">ppp0</code>.</p>
<p>Unless there’s some magic in the kernel for it to decide to destroy a connection simply for a packet mark mismatch, this is not close enough to the root cause. Verifying the magic is relatively easy:</p>
<div class="language-shell highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>iptables <span class="nt">-I</span> PREROUTING <span class="nt">-s</span> 172.17.0.2/32 <span class="nt">-j</span> Test
iptables <span class="nt">-A</span> Test <span class="nt">-m</span> conntrack <span class="nt">--ctstate</span> NEW <span class="nt">-j</span> MARK <span class="nt">--set-xmark</span> 0x36/0xff
iptables <span class="nt">-A</span> Test <span class="nt">-m</span> conntrack <span class="nt">--ctstate</span> ESTABLISHED <span class="nt">-j</span> MARK <span class="nt">--set-xmark</span> 0x30/0xff
</code></pre>
</div>
</div>
<p>This time, even if <code class="language-plaintext highlighter-rouge">conntrack</code> shows no mark (i.e. zero) on the connection, the packets are still routed correctly to <code class="language-plaintext highlighter-rouge">ppp0</code>, and curl gets stuck as the same place as before. So the kernel doesn’t care about the conntrack mark at all.</p>
<p>Unfortunately, this is about as far as userspace inspection can go. I need to find out why exactly the kernel decides to destroy the conntrack entry.</p>
<h2 id="bpftrace-comes-in"><code class="language-plaintext highlighter-rouge">bpftrace</code> comes in</h2>
<p>I’ve seen professional kernel network developers extensively running <code class="language-plaintext highlighter-rouge">bpftrace</code> to debug network issues (THANK YOU to the guy behind the Telegram channel <em>Welcome to the Black Parade</em>), so I’m giving it a try.</p>
<p>First thing is to figure out what to hook. Searching through Google did not reveal a trace point for conntrack events, but I get to know about the conntrack path. With help from ChatGPT, I begin with <code class="language-plaintext highlighter-rouge">kprobe:nf_ct_delete</code> and putting together all struct definitions starting from <code class="language-plaintext highlighter-rouge">struct nf_conn</code>:</p>
<div class="language-c highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="cp">#include</span> <span class="cpf"><linux/socket.h></span><span class="cp">
#include</span> <span class="cpf"><net/netfilter/nf_conntrack.h></span><span class="cp">
</span>
<span class="n">kprobe</span><span class="o">:</span><span class="n">nf_ct_delete</span>
<span class="p">{</span>
<span class="c1">// The first argument is the struct nf_conn</span>
<span class="err">$</span><span class="n">ct</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">nf_conn</span> <span class="o">*</span><span class="p">)</span><span class="n">arg0</span><span class="p">;</span>
<span class="c1">// Check if the connection is for IPv4</span>
<span class="k">if</span> <span class="p">(</span><span class="err">$</span><span class="n">ct</span><span class="o">-></span><span class="n">tuplehash</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">tuple</span><span class="p">.</span><span class="n">src</span><span class="p">.</span><span class="n">l3num</span> <span class="o">==</span> <span class="n">AF_INET</span><span class="p">)</span> <span class="p">{</span>
<span class="err">$</span><span class="n">src_ip</span> <span class="o">=</span> <span class="err">$</span><span class="n">ct</span><span class="o">-></span><span class="n">tuplehash</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">tuple</span><span class="p">.</span><span class="n">src</span><span class="p">.</span><span class="n">u3</span><span class="p">.</span><span class="n">ip</span><span class="p">;</span>
<span class="err">$</span><span class="n">dst_ip</span> <span class="o">=</span> <span class="err">$</span><span class="n">ct</span><span class="o">-></span><span class="n">tuplehash</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">tuple</span><span class="p">.</span><span class="n">dst</span><span class="p">.</span><span class="n">u3</span><span class="p">.</span><span class="n">ip</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"Conntrack destroyed (IPv4): src=%s dst=%s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span>
<span class="n">ntop</span><span class="p">(</span><span class="err">$</span><span class="n">src_ip</span><span class="p">),</span> <span class="n">ntop</span><span class="p">(</span><span class="err">$</span><span class="n">dst_ip</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre>
</div>
</div>
<p>Seems all good, except it won’t compile:</p>
<div class="language-text highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>ERROR: Can not access field 'u3' on expression of type 'none'
$dst_ip = $ct->tuplehash[0].tuple.dst.u3.ip;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
</code></pre>
</div>
</div>
<p>After another half-hour of struggling and bothering with ChatGPT, I gave up trying to access the destination tuple, and thought I’d be fine with inspecting the stack trace:</p>
<div class="language-c highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="cp">#include</span> <span class="cpf"><linux/socket.h></span><span class="cp">
#include</span> <span class="cpf"><net/netfilter/nf_conntrack.h></span><span class="cp">
</span>
<span class="n">kprobe</span><span class="o">:</span><span class="n">nf_ct_delete</span>
<span class="p">{</span>
<span class="c1">// The first argument is the struct nf_conn</span>
<span class="err">$</span><span class="n">ct</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">nf_conn</span> <span class="o">*</span><span class="p">)</span><span class="n">arg0</span><span class="p">;</span>
<span class="c1">// Check if the connection is for IPv4</span>
<span class="k">if</span> <span class="p">(</span><span class="err">$</span><span class="n">ct</span><span class="o">-></span><span class="n">tuplehash</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">tuple</span><span class="p">.</span><span class="n">src</span><span class="p">.</span><span class="n">l3num</span> <span class="o">==</span> <span class="n">AF_INET</span><span class="p">)</span> <span class="p">{</span>
<span class="err">$</span><span class="n">tuple_orig</span> <span class="o">=</span> <span class="err">$</span><span class="n">ct</span><span class="o">-></span><span class="n">tuplehash</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">tuple</span><span class="p">;</span>
<span class="err">$</span><span class="n">src_ip</span> <span class="o">=</span> <span class="err">$</span><span class="n">tuple_orig</span><span class="p">.</span><span class="n">src</span><span class="p">.</span><span class="n">u3</span><span class="p">.</span><span class="n">ip</span><span class="p">;</span>
<span class="err">$</span><span class="n">src_port_n</span> <span class="o">=</span> <span class="err">$</span><span class="n">tuple_orig</span><span class="p">.</span><span class="n">src</span><span class="p">.</span><span class="n">u</span><span class="p">.</span><span class="n">all</span><span class="p">;</span>
<span class="err">$</span><span class="n">src_port</span> <span class="o">=</span> <span class="p">(</span><span class="err">$</span><span class="n">src_port_n</span> <span class="o">>></span> <span class="mi">8</span><span class="p">)</span> <span class="o">|</span> <span class="p">((</span><span class="err">$</span><span class="n">src_port_n</span> <span class="o"><<</span> <span class="mi">8</span><span class="p">)</span> <span class="o">&</span> <span class="mh">0x00FF00</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="err">$</span><span class="n">src_ip</span> <span class="o">!=</span> <span class="mh">0x020011ac</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="err">$</span><span class="n">mark</span> <span class="o">=</span> <span class="err">$</span><span class="n">ct</span><span class="o">-></span><span class="n">mark</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"Conntrack destroyed (IPv4): src=%s sport=%d mark=%d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span>
<span class="n">ntop</span><span class="p">(</span><span class="err">$</span><span class="n">src_ip</span><span class="p">),</span> <span class="err">$</span><span class="n">src_port</span><span class="p">,</span> <span class="err">$</span><span class="n">mark</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">kstack</span><span class="p">());</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre>
</div>
</div>
<p>Noteworthy is that I have to filter the connections in the program, otherwise my screen gets flooded with unrelated events.</p>
<p>The output comes promising:</p>
<div class="language-text highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>Attaching 1 probe...
Conntrack destroyed (IPv4): src=172.17.0.2 sport=39456 mark=0 proto=6
nf_ct_delete+1
nf_nat_inet_fn+188
nf_nat_ipv4_out+80
nf_hook_slow+70
ip_output+220
ip_forward_finish+132
ip_forward+1296
ip_rcv+404
__netif_receive_skb_one_core+145
__netif_receive_skb+21
netif_receive_skb+300
...
</code></pre>
</div>
</div>
<p>Reading the source code from the top few functions of the call stack:</p>
<div class="language-c highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">// net/netfilter/nf_nat_proto.c</span>
<span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">int</span>
<span class="nf">nf_nat_ipv4_out</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">priv</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
<span class="k">const</span> <span class="k">struct</span> <span class="n">nf_hook_state</span> <span class="o">*</span><span class="n">state</span><span class="p">)</span>
<span class="p">{</span>
<span class="cp">#ifdef CONFIG_XFRM
</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">nf_conn</span> <span class="o">*</span><span class="n">ct</span><span class="p">;</span>
<span class="k">enum</span> <span class="n">ip_conntrack_info</span> <span class="n">ctinfo</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">err</span><span class="p">;</span>
<span class="cp">#endif
</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">ret</span><span class="p">;</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">nf_nat_ipv4_fn</span><span class="p">(</span><span class="n">priv</span><span class="p">,</span> <span class="n">skb</span><span class="p">,</span> <span class="n">state</span><span class="p">);</span> <span class="c1">// <-- call to nf_nat_ipv4_fn</span>
<span class="cp">#ifdef CONFIG_XFRM
</span> <span class="k">if</span> <span class="p">(</span><span class="n">ret</span> <span class="o">!=</span> <span class="n">NF_ACCEPT</span><span class="p">)</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
</code></pre>
</div>
</div>
<div class="language-c highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">// net/netfilter/nf_nat_proto.c</span>
<span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">int</span>
<span class="nf">nf_nat_ipv4_fn</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">priv</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
<span class="k">const</span> <span class="k">struct</span> <span class="n">nf_hook_state</span> <span class="o">*</span><span class="n">state</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// ...</span>
<span class="k">return</span> <span class="n">nf_nat_inet_fn</span><span class="p">(</span><span class="n">priv</span><span class="p">,</span> <span class="n">skb</span><span class="p">,</span> <span class="n">state</span><span class="p">);</span>
<span class="p">}</span>
</code></pre>
</div>
</div>
<div class="language-c highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c1">// net/netfilter/nf_nat_core.c</span>
<span class="kt">unsigned</span> <span class="kt">int</span>
<span class="nf">nf_nat_inet_fn</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">priv</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
<span class="k">const</span> <span class="k">struct</span> <span class="n">nf_hook_state</span> <span class="o">*</span><span class="n">state</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// ...</span>
<span class="k">if</span> <span class="p">(</span><span class="n">nf_nat_oif_changed</span><span class="p">(</span><span class="n">state</span><span class="o">-></span><span class="n">hook</span><span class="p">,</span> <span class="n">ctinfo</span><span class="p">,</span> <span class="n">nat</span><span class="p">,</span> <span class="n">state</span><span class="o">-></span><span class="n">out</span><span class="p">))</span>
<span class="k">goto</span> <span class="n">oif_changed</span><span class="p">;</span>
<span class="c1">// ...</span>
<span class="nl">oif_changed:</span>
<span class="n">nf_ct_kill_acct</span><span class="p">(</span><span class="n">ct</span><span class="p">,</span> <span class="n">ctinfo</span><span class="p">,</span> <span class="n">skb</span><span class="p">);</span>
<span class="k">return</span> <span class="n">NF_DROP</span><span class="p">;</span>
<span class="p">}</span>
</code></pre>
</div>
</div>
<p>As far as function inlining goes, there’s only one way <code class="language-plaintext highlighter-rouge">nf_nat_inet_fn</code> calls into <code class="language-plaintext highlighter-rouge">nf_ct_delete</code>, which is through <code class="language-plaintext highlighter-rouge">nf_ct_kill_acct</code>. And the only reason for that is <code class="language-plaintext highlighter-rouge">nf_nat_oif_changed</code>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Now everything makes sense. With a badly placed <code class="language-plaintext highlighter-rouge">ACCEPT</code> rule, the conntrack connection gets saved a different mark than desired, and then destroyed because subsequent packets are routed differently for having the wrong mark. The timestamp difference of related events also roughly matches up the distance of the code path. It also must be a NAT’ed connection, as this way of <code class="language-plaintext highlighter-rouge">nf_ct_delete</code> is only reachable when the packet is about to be sent to the egress interface.</p>
]]></content><author><name>iBug</name></author><category term="linux" /><category term="networking" /><summary type="html"><![CDATA[I run a Debian-based software router on my home network. It’s connected to multiple ISPs, so I have some policy routing rules to balance the traffic between them. Some time ago, I noticed that the IPv4 connectivity got stuck intermittently when it didn’t use to, while IPv6 was working fine. It’s also interesting that the issue only happened with one specific ISP, in the egress direction, and only a few specific devices were affected.]]></summary></entry><entry><title type="html">Driving pppd with systemd</title><link href="https://ibug.io/blog/2024/07/pppd-with-systemd/" rel="alternate" type="text/html" title="Driving pppd with systemd" /><published>2024-07-07T00:00:00+00:00</published><updated>2024-07-16T01:25:55+00:00</updated><id>https://ibug.io/blog/2024/07/pppd-with-systemd</id><content type="html" xml:base="https://ibug.io/blog/2024/07/pppd-with-systemd/"><![CDATA[<p>I moved my soft router (Intel N5105, Debian) from school to home, and at home it’s behind an ONU on bridge mode, so it’ll have to do PPPoE itself.</p>
<p>Getting started with PPPoE on Debian is exactly the same as on Ubuntu: Install <code class="language-plaintext highlighter-rouge">pppoeconf</code> and run <code class="language-plaintext highlighter-rouge">pppoeconf</code>, then fill in the DSL username and password. Then I can see <code class="language-plaintext highlighter-rouge">ppp0</code> interface up and working.</p>
<p>However, as I use <code class="language-plaintext highlighter-rouge">systemd-networkd</code> on my router while <code class="language-plaintext highlighter-rouge">pppd</code> appears to bundle ifupdown, I’ll have to fix everything needed for <code class="language-plaintext highlighter-rouge">pppd</code> to work with systemd-networkd.</p>
<h2 id="systemd-service">Systemd service</h2>
<p>The first thing is to get it to start at boot. Looking through Google, a <a href="https://gist.github.com/rany2/330c8fe202b318cacdcb54830c20f98c">Gist</a> provides the exact systemd service file I need. After copying it to <code class="language-plaintext highlighter-rouge">/etc/systemd/system/[email protected]</code>, I tried to start it with <code class="language-plaintext highlighter-rouge">systemctl start pppd@dsl-provider</code>. It seems like there’s a misconfiguration:</p>
<div class="language-text highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>/usr/sbin/pppd: Can't open options file /etc/ppp/peers/dsl/provider: No such file or directory
</code></pre>
</div>
</div>
<p>The instance name is surely <code class="language-plaintext highlighter-rouge">dsl-provider</code> and not <code class="language-plaintext highlighter-rouge">dsl/provider</code>, so I look more closely at the service file.</p>
<div class="language-ini highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="nn">[...]</span>
<span class="py">Description</span><span class="p">=</span><span class="s">PPP connection for %I</span>
<span class="nn">[...]</span>
<span class="py">ExecStart</span><span class="p">=</span><span class="s">/usr/sbin/pppd up_sdnotify nolog call %I</span>
</code></pre>
</div>
</div>
<p>The systemd man page <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html"><code class="language-plaintext highlighter-rouge">systemd.unit(5)</code></a> says:</p>
<blockquote>
<table>
<thead>
<tr>
<th>Specifier</th>
<th>Meaning</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>“%i”</td>
<td>Instance name</td>
<td>For instantiated units this is the string between the first “@” character and the type suffix. Empty for non-instantiated units.</td>
</tr>
<tr>
<td>“%I”</td>
<td>Unescaped instance name</td>
<td>Same as “%i”, but with escaping undone.</td>
</tr>
</tbody>
</table>
</blockquote>
<p>Fair enough, let’s change <code class="language-plaintext highlighter-rouge">%I</code> to <code class="language-plaintext highlighter-rouge">%i</code> and try starting <code class="language-plaintext highlighter-rouge">pppd@dsl-provider</code> again.</p>
<h2 id="systemd-networkd">systemd-networkd</h2>
<p>Now that <code class="language-plaintext highlighter-rouge">ppp0</code> is up, time to configure routes and routing rules with <code class="language-plaintext highlighter-rouge">systemd-networkd</code>. I created a file <code class="language-plaintext highlighter-rouge">/etc/systemd/network/10-ppp0.network</code>.</p>
<div class="language-ini highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="nn">[Match]</span>
<span class="py">Name</span><span class="p">=</span><span class="s">ppp0</span>
<span class="nn">[Network]</span>
<span class="py">DHCP</span><span class="p">=</span><span class="s">yes</span>
<span class="c"># ...
</span></code></pre>
</div>
</div>
<p>After restarting systemd-networkd, I was disappointed to see the PPP-negotiated IP address removed, only leaving an SLAAC IPv6 address behind. With some searching through <code class="language-plaintext highlighter-rouge">systemd.network(5)</code>, I found <code class="language-plaintext highlighter-rouge">KeepConfiguration=yes</code> was what I was looking for.</p>
<h2 id="start-order">Start order</h2>
<p>One problem still remains: At the time systemd-networkd starts, <code class="language-plaintext highlighter-rouge">ppp0</code> is not yet up, and systemd-networkd simply skips its configuration. A solution seems trivial:</p>
<div class="language-ini highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c"># systemctl edit pppd@dsl-provider
</span><span class="nn">[Unit]</span>
<span class="py">Before</span><span class="p">=</span><span class="s">systemd-networkd.service</span>
</code></pre>
</div>
</div>
<p>… except it doesn’t seem to have any effect.</p>
<p>I wouldn’t bother digging into pppd, so I look around for something analogous to ifupdown’s <code class="language-plaintext highlighter-rouge">up</code> script, which is <code class="language-plaintext highlighter-rouge">/etc/ppp/ip-up.d/</code>. So I could just drop another script to notify systemd-networkd.</p>
<div class="language-shell highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c"># /etc/ppp/ip-up.d/1systemd-networkd</span>
<span class="c">#!/bin/sh</span>
networkctl reconfigure <span class="s2">"</span><span class="nv">$PPP_IFACE</span><span class="s2">"</span>
</code></pre>
</div>
</div>
<p>I also noticed that when bringing in ifupdown, the <code class="language-plaintext highlighter-rouge">pppoeconf</code>-created config looks like this:</p>
<div class="language-shell highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>auto dsl-provider
iface dsl-provider inet ppp
pre-up /bin/ip <span class="nb">link set </span>enp3s0 up <span class="c"># line maintained by pppoeconf</span>
provider dsl-provider
</code></pre>
</div>
</div>
<p>So to maintain behavioral compatibility, I configured the systemd service like this:</p>
<div class="language-ini highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="c"># systemctl edit pppd@dsl-provider
</span><span class="nn">[Unit]</span>
<span class="py">BindsTo</span><span class="p">=</span><span class="s">sys-subsystem-net-devices-enp3s0.device</span>
<span class="py">After</span><span class="p">=</span><span class="s">sys-subsystem-net-devices-enp3s0.device</span>
</code></pre>
</div>
</div>
<p>After multiple reboots and manual restarts of <code class="language-plaintext highlighter-rouge">[email protected]</code>, I’m convinced that this is a reliable solution.</p>
<h2 id="extra">Extra: IPv6 PD</h2>
<p>As the home ISP provides IPv6 Prefix Delegation (but my school didn’t), it would be nice to take it and distribute it to the LAN. Online tutorials are abundant, e.g. <a href="https://major.io/p/dhcpv6-prefix-delegation-with-systemd-networkd/" rel="nofollow noopener">this one</a>. With everything set supposedly up, I was again disappointed to see only a single SLAAC IPv6 address on <code class="language-plaintext highlighter-rouge">ppp0</code> itself, and <code class="language-plaintext highlighter-rouge">journalctl -eu systemd-networkd</code> shows no sign of receiving a PD allocation.</p>
<p>After poking around with <code class="language-plaintext highlighter-rouge">IPv6AcceptRA=</code> and <code class="language-plaintext highlighter-rouge">[DHCPv6] PrefixDelegationHint=</code> settings for a while, I decided to capture some packets for investigation. I started <code class="language-plaintext highlighter-rouge">tcpdump -i ppp0 -w /tmp/ppp0.pcap icmp6 or udp port 546</code> and restarted <code class="language-plaintext highlighter-rouge">systemd-networkd</code>. After a few seconds, the pcap file contains exactly 4 packets that I need (some items omitted for brevity):</p>
<div class="language-markdown highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="p">-</span> ICMPv6: Router Solicitation from 00:00:00:00:00:00
<span class="p">-</span> ICMPv6: Router Advertisement from 00:00:5e:00:01:99
<span class="p"> -</span> Flags: 0x40 (only O)
<span class="p"> -</span> ICMPv6 Option: Prefix information (2001:db8::/64)
<span class="p"> -</span> Flags: L + A
<span class="p">-</span> DHCPv6: Information-request XID: 0x8bf4f0 CID: 00020000ab11503f79e54f10745d
<span class="p"> -</span> Option Request
<span class="p"> -</span> Option: Option Request (6)
<span class="p"> -</span> Length: 10
<span class="p"> -</span> Requested Option code: DNS recursive name server (23)
<span class="p"> -</span> Requested Option code: Simple Network Time Protocol Server (31)
<span class="p"> -</span> Requested Option code: Lifetime (32)
<span class="p"> -</span> Requested Option code: NTP Server (56)
<span class="p"> -</span> Requested Option code: INF_MAX_RT (83)
<span class="p">-</span> DHCPv6: Reply XID: 0x8bf4f0 CID: 00020000ab11503f79e54f10745d
</code></pre>
</div>
</div>
<p>Clearly the client isn’t even requesting a PD allocation with <code class="language-plaintext highlighter-rouge">PrefixDelegationHint=</code> set. With some more Google-ing, I added <code class="language-plaintext highlighter-rouge">[DHCPv6] WithoutRA=solicit</code> to <code class="language-plaintext highlighter-rouge">10-ppp0.network</code> and restarted <code class="language-plaintext highlighter-rouge">systemd-networkd</code>. There are 6 packets, but the order appears a little bit off:</p>
<div class="language-markdown highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="p">-</span> Solicit XID: 0x2bc2aa CID: 00020000ab11503f79e54f10745d
<span class="p">-</span> Advertise XID: 0x2bc2aa CID: 00020000ab11503f79e54f10745d
<span class="p">-</span> Request XID: 0xf8c1dd CID: 00020000ab11503f79e54f10745d
<span class="p"> -</span> Identity Association for Prefix Delegation
<span class="p">-</span> Reply XID: 0xf8c1dd CID: 00020000ab11503f79e54f10745d
<span class="p">-</span> Router Solicitation from 00:00:00:00:00:00
<span class="p">-</span> Router Advertisement from 00:00:5e:00:01:99
</code></pre>
</div>
</div>
<p>This time DHCP request comes <em>before</em> the RS/RA pair, which is not what I expected. But at least it’s now requesting a PD prefix.</p>
<p>Then I found <a href="https://unix.stackexchange.com/a/715025/211239">this answer</a> straight to the point, summarized as:</p>
<ul>
<li>The “managed” (M) flag indicates the client should acquire an address via DHCPv6, and triggers DHCPv6 Solicit and Request messages.</li>
<li>The “other” (O) flag indicates the client should do SLAAC while acquiring other configuration information via DHCPv6, and triggers DHCPv6 Information-request messages.</li>
<li>When both flags are present, the O flag is superseded by the M flag and has no effect.</li>
</ul>
<p>So systemd-networkd is implementing everything correctly, and I should configure systemd-networkd to always send Solicit messages regardless of the RA flags received. This is done by setting <code class="language-plaintext highlighter-rouge">[IPv6AcceptRA] DHCPv6Client=always</code></p>
<p>Now with every detail understood, after a restart of <code class="language-plaintext highlighter-rouge">systemd-networkd</code>, I finally see the PD prefix allocated:</p>
<div class="language-text highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>systemd-networkd[528]: ppp0: DHCP: received delegated prefix 2001:db8:0:a00::/60
systemd-networkd[528]: enp1s0: DHCP-PD address 2001:db8:0:a00:2a0:c9ff:feee:c4b/64 (valid for 2d 23h 59min 59s, preferred for 1d 23h 59min 59s)
systemd-networkd[528]: enp2s0: DHCP-PD address 2001:db8:0:a01:2a0:c9ff:feee:c4c/64 (valid for 2d 23h 59min 59s, preferred for 1d 23h 59min 59s)
</code></pre>
</div>
</div>
<h2 id="update-1">Update: Stuck booting</h2>
<p>A few days after this blog post, my local ISP ran into an outage that rendered the PPPoE connection unoperational.
When I couldn’t identify the issue initially, I tried rebooting the router and it never came back up again.