Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed chunk size #173

Merged
merged 13 commits into from
Mar 6, 2024
Merged

fixed chunk size #173

merged 13 commits into from
Mar 6, 2024

Conversation

philandstuff
Copy link
Contributor

We have a fair amount of complexity (EqualSplit etc) involved in dynamically splitting files down into chunks based on available concurrency.

We could be less clever and use a fixed chunk size, with a fixed buffer size to go with it. This would dovetail nicely with a sync.Pool of fixed-size buffers, and also with a configuration option to limit the total allocated RAM (implemented as a semaphore guarding how many buffers can be fetched from the sync.Pool).

This commit only does consistent hashing mode; we should do buffer mode as well if we do this work.

We have a fair amount of complexity (EqualSplit etc) involved in dynamically
splitting files down into chunks based on available concurrency.

We could be less clever and use a fixed chunk size, with a fixed buffer size to
go with it.  This would dovetail nicely with a sync.Pool of fixed-size buffers,
and also with a configuration option to limit the total allocated
RAM (implemented as a semaphore guarding how many buffers can be fetched from
the sync.Pool).

This commit only does consistent hashing mode; we should do buffer mode as well
if we do this work.
@philandstuff philandstuff requested a review from a team March 4, 2024 17:59
@dkhokhlov
Copy link

side note:
https://github.com/replicate/pget/pull/173/files#diff-ec1a0e746b82d63c33a617e0b8dd205d92ac3cec2213443a6e29af27aaa37f28R139-R145
checks for one var (firstReqResult.err) but uses another var (err) in diag.

For const and var shared between buffer and consistent-hash mode move these
constants to a dedicated .go file. These are two distinct downloaders
and neither should be responsible for explicitly shared elements.
Wire up the --chunk-size option and deprecate --minimum-chunk-size.
The concurrency calculation is no longer used in consistent-hashing.
Remove it.
Update Buffer downloader to use ChunkSize instead of MinChunkSize.
Options have been modified to only contain ChunkSize now as the root and
multifile commands no longer reference MinChunkSize.

This change is in support of moving towards simplifying the chunksize
logic. Additionally this will make reuse of buffers significantly easier
as the next phase.
pkg/download/buffer.go Outdated Show resolved Hide resolved
If chunkSize doesn't divide contentSize evenly, and chunkSize < 0.5*contentSize,
we try to download a chunk bigger than chunkSize.  This should not be possible.
Every chunk except the last is now the same chunkSize, which is the intention of
the code.  We no longer dynamically change chunk size.

Also, simplify the consistent hashing code to be more similar to buffer mode,
and remove the dead EqualSplit code.
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
cmd/root/root.go Outdated
@@ -199,7 +202,7 @@ func runRootCMD(cmd *cobra.Command, args []string) error {
// rootExecute is the main function of the program and encapsulates the general logic
// returns any/all errors to the caller.
func rootExecute(ctx context.Context, urlString, dest string) error {
minChunkSize, err := humanize.ParseBytes(viper.GetString(config.OptMinimumChunkSize))
chunkSize, err := humanize.ParseBytes(viper.GetString(config.OptMinimumChunkSize))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chunkSize, err := humanize.ParseBytes(viper.GetString(config.OptMinimumChunkSize))
chunkSize, err := humanize.ParseBytes(viper.GetString(config.OptChunkSize))

We should still check MinimumChunkSize and adjust ChunkSize upwards to satisfy it if it's set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear why we are bothering to keep min chunk size. It seems... odd? Let's just specify the chunk size we want. Having two controls for the same thing is how we got into the mess of too many ways to control concurrency.

If someone wants a chunk size of 1byte... it's silly but it's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because pget is already deployed to many models which set it

if they set min chunk size to 500MiB we should respect that and not have a 125MiB chunk size.

Copy link
Member

@tempusfrangit tempusfrangit Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had wired up minchunk size and chunk size to the same value behind the scenes so they should be able to be set either flag but with --minchunk size deprecated.

The exception is in prod where our value would continue to override (since we set with the config map).

The only caveat is if you specified both, the last one processed wins.

I see that I may not have been clear about that change. Totally on me and I'll be better call those things out in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes I missed that. However:

  • we should not refer to the deprecated name on this line, so my suggestion above still stands
  • we wire both args up to the chunkSize variable but we never read that variable so it's not obvious to me that it matters (instead we fetch with viper). I don't have a good mental model for what happens? I'm not sure why we bind a variable if we're never reading it?
  • pflag has docs on how to alias flag names, maybe we should do that instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aliasing is a bit weird with viper involved. Generally using an XXXVarP type instead of XXXP in cobra is the easiest way to alias as it backs to a pointer. I'm really ok either way. Your suggestions do absolutely apply in either case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have viper, again? is it so we can configure via environment variable? it's all a bit magic, I don't understand it, and i don't find the documentation helpful in the detail (like this variable aliasing thing, I don't think the documentation defines the behaviour here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Viper is for the ENV binding to the flags. It's a lot of magic

Variable aliasing results in the last processed element winning i'll mkae sure the arguments cannot be both used. It should address the concerns. The ENV bits is a bit more work I'll make sure it's also clear (or at least documented) behavior

pkg/download/options.go Outdated Show resolved Hide resolved
philandstuff and others added 5 commits March 5, 2024 12:18
Aliases in Cobra and Viper are relatively complex and
viper.RegisterAlias does not do what would be expected as ENV handling
is a very special case. This commit fixes up the handling of aliases
specifically for chunk size and minimuim chunk size.

* CLI Option Set
* NEW ENV VAR
* Deprecated ENV Var
* Default Value

For ChunkSize and MinimumChunkSize we now explicitly disallow setting
both --chunk-size and --minimum-chunk-size as CLI options.

If PGET_CHUNK_SIZE is set, we utilize that value except if the cli
option(s) are set.

If PGET_MINIMUM_CHUNK_SIZE is set
* If PGET_CHUNK_SIZE is set to a non-default value a warning is emitted,
  and PGET_CHUNK_SIZE is used..

* If PGET_CHUNK_SIZE is the default value, PGET_MINIMUM_CHUNK_SIZE is
  set as PGET_CHUNK_SIZE and a deprecation warning is emitted.

Additionally, viper calls now all reference config.OptChunkSize
@tempusfrangit tempusfrangit changed the title WIP: fixed chunk size fixed chunk size Mar 5, 2024
Copy link
Member

@tempusfrangit tempusfrangit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving at this point, however, @philandstuff I'd like you to also take another look at this before we land it.

cmd/root/root.go Outdated
changedChunk := cmd.PersistentFlags().Changed(config.OptChunkSize)
if changedMin && changedChunk {
return fmt.Errorf("--minimum-chunk-size and --chunk-size cannot be used at the same time, use --chunk-size instead")
} else if !(changedMin && changedChunk) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if !(changedMin && changedChunk) {
} else {

in this house, we accept the principal of the excluded middle, so this condition is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, because the if block returns, we don't need an else at all

Simplify conditional.

Also, if both PGET_MINIMUM_CHUNK_SIZE and PGET_CHUNK_SIZE are set to the same
value, don't emit a warning.
@philandstuff philandstuff merged commit 58cccd0 into main Mar 6, 2024
5 checks passed
@philandstuff philandstuff deleted the fixed-chunk-size branch March 6, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants