-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixed chunk size #173
fixed chunk size #173
Conversation
We have a fair amount of complexity (EqualSplit etc) involved in dynamically splitting files down into chunks based on available concurrency. We could be less clever and use a fixed chunk size, with a fixed buffer size to go with it. This would dovetail nicely with a sync.Pool of fixed-size buffers, and also with a configuration option to limit the total allocated RAM (implemented as a semaphore guarding how many buffers can be fetched from the sync.Pool). This commit only does consistent hashing mode; we should do buffer mode as well if we do this work.
side note: |
For const and var shared between buffer and consistent-hash mode move these constants to a dedicated .go file. These are two distinct downloaders and neither should be responsible for explicitly shared elements.
Wire up the --chunk-size option and deprecate --minimum-chunk-size.
The concurrency calculation is no longer used in consistent-hashing. Remove it.
Update Buffer downloader to use ChunkSize instead of MinChunkSize. Options have been modified to only contain ChunkSize now as the root and multifile commands no longer reference MinChunkSize. This change is in support of moving towards simplifying the chunksize logic. Additionally this will make reuse of buffers significantly easier as the next phase.
If chunkSize doesn't divide contentSize evenly, and chunkSize < 0.5*contentSize, we try to download a chunk bigger than chunkSize. This should not be possible.
3254d15
to
31bc26d
Compare
Every chunk except the last is now the same chunkSize, which is the intention of the code. We no longer dynamically change chunk size. Also, simplify the consistent hashing code to be more similar to buffer mode, and remove the dead EqualSplit code.
cmd/root/root.go
Outdated
@@ -199,7 +202,7 @@ func runRootCMD(cmd *cobra.Command, args []string) error { | |||
// rootExecute is the main function of the program and encapsulates the general logic | |||
// returns any/all errors to the caller. | |||
func rootExecute(ctx context.Context, urlString, dest string) error { | |||
minChunkSize, err := humanize.ParseBytes(viper.GetString(config.OptMinimumChunkSize)) | |||
chunkSize, err := humanize.ParseBytes(viper.GetString(config.OptMinimumChunkSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunkSize, err := humanize.ParseBytes(viper.GetString(config.OptMinimumChunkSize)) | |
chunkSize, err := humanize.ParseBytes(viper.GetString(config.OptChunkSize)) |
We should still check MinimumChunkSize and adjust ChunkSize upwards to satisfy it if it's set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear why we are bothering to keep min chunk size. It seems... odd? Let's just specify the chunk size we want. Having two controls for the same thing is how we got into the mess of too many ways to control concurrency.
If someone wants a chunk size of 1byte... it's silly but it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because pget is already deployed to many models which set it
if they set min chunk size to 500MiB we should respect that and not have a 125MiB chunk size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had wired up minchunk size and chunk size to the same value behind the scenes so they should be able to be set either flag but with --minchunk size deprecated.
The exception is in prod where our value would continue to override (since we set with the config map).
The only caveat is if you specified both, the last one processed wins.
I see that I may not have been clear about that change. Totally on me and I'll be better call those things out in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes I missed that. However:
- we should not refer to the deprecated name on this line, so my suggestion above still stands
- we wire both args up to the
chunkSize
variable but we never read that variable so it's not obvious to me that it matters (instead we fetch with viper). I don't have a good mental model for what happens? I'm not sure why we bind a variable if we're never reading it? - pflag has docs on how to alias flag names, maybe we should do that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aliasing is a bit weird with viper involved. Generally using an XXXVarP type instead of XXXP in cobra is the easiest way to alias as it backs to a pointer. I'm really ok either way. Your suggestions do absolutely apply in either case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have viper, again? is it so we can configure via environment variable? it's all a bit magic, I don't understand it, and i don't find the documentation helpful in the detail (like this variable aliasing thing, I don't think the documentation defines the behaviour here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Viper is for the ENV binding to the flags. It's a lot of magic
Variable aliasing results in the last processed element winning i'll mkae sure the arguments cannot be both used. It should address the concerns. The ENV bits is a bit more work I'll make sure it's also clear (or at least documented) behavior
Aliases in Cobra and Viper are relatively complex and viper.RegisterAlias does not do what would be expected as ENV handling is a very special case. This commit fixes up the handling of aliases specifically for chunk size and minimuim chunk size. * CLI Option Set * NEW ENV VAR * Deprecated ENV Var * Default Value For ChunkSize and MinimumChunkSize we now explicitly disallow setting both --chunk-size and --minimum-chunk-size as CLI options. If PGET_CHUNK_SIZE is set, we utilize that value except if the cli option(s) are set. If PGET_MINIMUM_CHUNK_SIZE is set * If PGET_CHUNK_SIZE is set to a non-default value a warning is emitted, and PGET_CHUNK_SIZE is used.. * If PGET_CHUNK_SIZE is the default value, PGET_MINIMUM_CHUNK_SIZE is set as PGET_CHUNK_SIZE and a deprecation warning is emitted. Additionally, viper calls now all reference config.OptChunkSize
716bf17
to
e71b6b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving at this point, however, @philandstuff I'd like you to also take another look at this before we land it.
cmd/root/root.go
Outdated
changedChunk := cmd.PersistentFlags().Changed(config.OptChunkSize) | ||
if changedMin && changedChunk { | ||
return fmt.Errorf("--minimum-chunk-size and --chunk-size cannot be used at the same time, use --chunk-size instead") | ||
} else if !(changedMin && changedChunk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if !(changedMin && changedChunk) { | |
} else { |
in this house, we accept the principal of the excluded middle, so this condition is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, because the if
block returns, we don't need an else
at all
Simplify conditional. Also, if both PGET_MINIMUM_CHUNK_SIZE and PGET_CHUNK_SIZE are set to the same value, don't emit a warning.
We have a fair amount of complexity (EqualSplit etc) involved in dynamically splitting files down into chunks based on available concurrency.
We could be less clever and use a fixed chunk size, with a fixed buffer size to go with it. This would dovetail nicely with a sync.Pool of fixed-size buffers, and also with a configuration option to limit the total allocated RAM (implemented as a semaphore guarding how many buffers can be fetched from the sync.Pool).
This commit only does consistent hashing mode; we should do buffer mode as well if we do this work.