Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Maybe we should use dynamic shared memory by default #138

Closed
LeiWang1999 opened this issue Aug 9, 2024 · 2 comments
Closed
Assignees
Labels
good first issue Good for newcomers

Comments

@LeiWang1999
Copy link
Contributor

Previously, we observed that two schedules with the same hint, but with different shared memory scopes—shared.dyn and shared—exhibited different performance. Specifically, shared.dyn consistently underperformed compared to shared. As a result, our design has favored using static shared memory. However, the fix introduced in this commit resolved the issue by eliminating 20% of the redundant sync primitives in shared.dyn. Consequently, their performance should now be comparable.

Given this improvement, I suggest we consider converting the shared memory to shared.dyn to explore more tile candidates. However, it's important to benchmark the results to ensure that this change does not negatively impact performance.

@LeiWang1999 LeiWang1999 added the good first issue Good for newcomers label Aug 9, 2024
@LeiWang1999 LeiWang1999 self-assigned this Aug 9, 2024
@LeiWang1999
Copy link
Contributor Author

The benchmark results for random selected benchmark sets, I think it's acceptable.

Input arguments: (1, 16384, 16384, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.08703966666666665, Dynamic latency: 0.087381, Difference: -0.00034133333333334626
Input arguments: (16, 16384, 16384, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.37512500000000004, Dynamic latency: 0.37512500000000004, Difference: 0.0
Input arguments: (32, 16384, 16384, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.6386346666666667, Dynamic latency: 0.6362450000000001, Difference: 0.0023896666666666233
Input arguments: (64, 16384, 16384, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.944128, Dynamic latency: 0.9458343333333332, Difference: -0.0017063333333332542
Input arguments: (128, 16384, 16384, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 1.5605756666666668, Dynamic latency: 1.5616, Difference: -0.0010243333333332938
Input arguments: (256, 16384, 16384, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 2.6047146666666667, Dynamic latency: 2.6043730000000003, Difference: 0.0003416666666664625
Input arguments: (1024, 16384, 16384, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 9.431722333333333, Dynamic latency: 9.434453000000001, Difference: -0.0027306666666682133
Input arguments: (16, 43008, 14336, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.641365, Dynamic latency: 0.6410236666666667, Difference: 0.0003413333333333046
Input arguments: (32, 14336, 14336, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.418816, Dynamic latency: 0.43144499999999997, Difference: -0.012628999999999946
Input arguments: (64, 57344, 14336, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 1.8684583333333333, Dynamic latency: 1.8688, Difference: -0.00034166666666668455
Input arguments: (128, 14336, 57344, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 3.626666666666667, Dynamic latency: 3.6242773333333336, Difference: 0.0023893333333333544
Input arguments: (256, 9216, 9216, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.5058556666666667, Dynamic latency: 0.5065386666666667, Difference: -0.0006829999999999892
Input arguments: (128, 36864, 9216, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.9878183333333335, Dynamic latency: 0.987477, Difference: 0.00034133333333341564
Input arguments: (64, 9216, 36864, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.8321703333333332, Dynamic latency: 0.830805, Difference: 0.0013653333333332185
Input arguments: (32, 22016, 8192, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.37785599999999997, Dynamic latency: 0.37785599999999997, Difference: 0.0
Input arguments: (16, 8192, 22016, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.3908263333333334, Dynamic latency: 0.3904849999999999, Difference: 0.00034133333333347116
Input arguments: (32, 8192, 8192, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.19865600000000005, Dynamic latency: 0.19797299999999998, Difference: 0.0006830000000000724
Input arguments: (64, 28672, 8192, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 0.5645653333333335, Dynamic latency: 0.5645653333333335, Difference: 0.0
Input arguments: (128, 8192, 28672, 'float16', 'int4', 'float16', 'float16', 'nt', False, None, False, False, None), Static latency: 1.6230399999999998, Dynamic latency: 1.6240636666666666, Difference: -0.001023666666666756

let's do it.

@LeiWang1999
Copy link
Contributor Author

This has been merged by PR #133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant