Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable use of cloud reference files #1804

Merged
merged 7 commits into from
Feb 5, 2024
Merged

Conversation

cmnbroad
Copy link
Contributor

@cmnbroad cmnbroad commented May 2, 2022

...by updating the ReferenceArgumentCollection implementations to use nio paths.

(Edited by tsato on 8/14/23)

This PR updates ReferenceArgumentCollection and classes that implement it to support reference files stored in the cloud. Now DownsampleSam and CreateSequenceDictionary support cloud reference files. Updating of the remaining tools will be done in a separate PR.

Legacy tools may continue to access the reference via File REFERENCE_SEQUENCE in CommandLineProgram or by getReferenceFile() in ReferenceArgumentCollection.

Going forward the tools should access the reference via these methods in ReferenceArgumentCollection:

  • Path getReferencePath() : access the reference as Path while avoiding NPE in case it's null.
  • PicardHtsPath getHtsPath() : the main way to access the reference

@cmnbroad cmnbroad changed the title Enable the use of cloud reference files Enable use of cloud reference files May 2, 2022
@cmnbroad
Copy link
Contributor Author

cmnbroad commented May 2, 2022

@markjschreiber This is the branch I mentioned in email - would be useful to know if this works with AWS.

@markjschreiber
Copy link
Contributor

This would work for S3. Any reason for using HtsPath rather than PicardHtsPath? I think the rationale for creating PicardHtsPath as an extension of HtsPath was that it would allow the possibility in the future of adding methods to configure Google Storage clients or other cloud storage clients that required special configuration.

@cmnbroad
Copy link
Contributor Author

cmnbroad commented May 2, 2022

Oh yeah - thanks for catching that - we definitely should use PicardHtsPath - I'm just trying to do too many things in parallel today. I won't be able to finish a full PR today, but I'm testing getReferencePath() on a hacked Picard tool on GCS now.

@markjschreiber
Copy link
Contributor

If this works then I think we should deprecate direct access of REFERENCE_PATH in CommandLineProgram as well as SinglePassSamProgram favoring the use of getReferencePath() instead so that we have a path forward (if you'll forgive the pun).

@cmnbroad
Copy link
Contributor Author

cmnbroad commented May 3, 2022

Naturally, this was a little more complicated than I first thought, but I think it's correct now. It will allow us to incrementally migrate tools to support cloud references without introducing any compatibility issues. Specifically, it retains and populates the REFERENCE_SEQUENCE variable of type File in CommandLineProgram for tools that depend on that. Those tools can't use cloud reference files now, and if presented with one with theses changes will fail in the same way that they do now. To migrate a tool, it needs to be changed from using REFERENCE_SEQUENCE or referenceArgumentCollection.getReferenceFile(), to instead use referenceArgumentCollection.getReferencePath() or referenceArgumentCollection.getHtsPath(). I've tested this manually using using CleanSam, which I've migrated in this PR (for references anyway) with a reference on GCS. Keeping this as draft until all tests pass.

@cmnbroad cmnbroad force-pushed the cn_nio_reference_files branch 2 times, most recently from 628a84a to 0d3398e Compare May 3, 2022 14:48
@cmnbroad cmnbroad marked this pull request as ready for review May 9, 2022 14:52
@gbggrant gbggrant requested a review from yfarjoun May 25, 2022 17:54
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from the lack of a test with a reference on the cloud -- we'd need that before this can go in

@cmnbroad cmnbroad marked this pull request as draft October 11, 2022 17:21
@cmnbroad
Copy link
Contributor Author

@takutosato This PR is currently draft because it needs a cloud test. We should resurrect it when you start on the Picard cloud work, and add a test once we have a way to do that.

return null;
} else if (picardPath.getScheme().equals(PicardHtsPath.FILE_SCHEME)) {
// file on a local file system
return picardPath == null ? null : picardPath.toPath().toFile();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return picardPath.toPath().toFile() because picardPath is never null if the code gets here

@takutosato takutosato self-assigned this Jul 19, 2023
@takutosato takutosato marked this pull request as ready for review August 14, 2023 18:21
@takutosato
Copy link
Contributor

@droazen and @cmnbroad this PR is ready for review. Will submit the accompanying PR in htsjdk soon too.

Copy link
Contributor Author

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tsato. I started reviewing, but I think the previous PR (the one that included GATKBucketUtils and GATKIOUtils) propagates some code from GATK that will cause us to go down a path where we reproduce a lot of the patterns from GATK that we're trying to eradicate. I put a couple of comments here inline - take a look at them and see if they make sense to you, and maybe we can meet to discuss if you want.

But my basic uber-comment is to try to eliminate any code that attempts to manipulate/modify/interrogate a PicardHtsPath by converting it to a String, and to only convert to File, or String if you absolutely have to.

So for instance, the methods in GATKBucketUtils that accept or return a String would ideally be replaced with versions that manipulate a PicardHtsPath. Of course there may be cases where we have no choice, but I think most of the code here can be modified to eliminate these code patterns.

@@ -43,6 +43,9 @@ private GATKBucketUtils(){} //private so that no one will instantiate this class
*/
public static String getTempFilePath(String prefix, String extension){
if (isGcsUrl(prefix) || (isHadoopUrl(prefix))){
if (!extension.startsWith(".")) {
extension = "." + extension;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of modifying extension here - I think it would be preferable to have this method use the extension as presented, and document that behavior, and require the call sites specify the "." if thats what they want (we haven't been consistent about this in the past).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why does this file have a GATK prefixes in the name ?

Copy link
Contributor

@takutosato takutosato Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here was that the method was behaving inconsistently when the input is a local file path (always return .txt, modifying the extension where necessary) vs a gs:// file path (do not modify the extension, which resulted in temp file path like "filetxt"). I like your approach too; will change it for both cases.

final String cloudPath = GATKBucketUtils.getTempFilePath(GCloudTestUtils.getTestInputPath(), "txt");
final String localPath = GATKBucketUtils.getTempFilePath("test", "txt");

int d = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused.

validateSamFile.INPUT = downsampled;
Assert.assertEquals(validateSamFile.doWork(), 0);
// temporarily skip this check for cram files until ValidateSam is updated the support cloud reference input
if (! downsampled.getRawInputString().endsWith(".cram")){
Copy link
Contributor Author

@cmnbroad cmnbroad Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (! downsampled.getRawInputString().endsWith(".cram")){
if (! downsampled.hasExtension(FileExtensions.CRAM)){

Converting these paths (like downSampled) to String or File is exactly the code pattern we want to eradicate. In an ideal world everything would be done in PicardHTSPath space, and we would only just-in-time convert to String or Path, or in the worst case File, if its absolutely necessary because we're calling some external API that requires it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -181,30 +184,40 @@ private Iterable<SAMSequenceRecord> getSamSequenceRecordsIterable() {
*/
protected String[] customCommandLineValidation() {
if (URI == null) {
URI = "file:" + referenceSequence.getReferenceFile().getAbsolutePath();
if (GATKBucketUtils.isRemoteStorageUrl(referenceSequence.getHtsPath().getURIString())){
Copy link
Contributor Author

@cmnbroad cmnbroad Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting to a String like this is exactly the kind of thing we want to avoid wherever possible. It would be much preferable to change isRemoteStorageURL to take a PicardHtsPath directly (or make isRemoteStorageURL a PicardHtsPath method), to discourage this kind of interconversion (see my comment elsewhere on String interconversion).

@@ -167,7 +170,7 @@ public SAMSequenceDictionary makeSequenceDictionary(final File referenceFile) {
private Iterable<SAMSequenceRecord> getSamSequenceRecordsIterable() {
return () -> {
final SequenceDictionaryUtils.SamSequenceRecordsIterator iterator =
new SequenceDictionaryUtils.SamSequenceRecordsIterator(REFERENCE_SEQUENCE,
new SequenceDictionaryUtils.SamSequenceRecordsIterator(referenceSequence.getHtsPath().toPath(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use 'referenceSequence.getReferencePath()' instead of duplicating the 'toPath()' call here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

URI = referenceSequence.getHtsPath().getURIString();
} else {
URI = "file:" + referenceSequence.getHtsPath().getURIString();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this is the kind of string manipulation (and bug) that plagues the GATK code base - its really hard to preserve the structure of the URI when you do string manipulation like this. So we don't want to reproduce these patterns in Picard.

As it stands, this code will produce a URI string with a duplicate "file:" prefix for local files, since the protocol scheme is already present in the URI string returned by getURIString() (i.e., if the reference is a file called "myref.fasta", the value assigned to URI resulting from this will be "file:file:///path/to/myref.fasta").

Instead, you can just use:

Suggested change
}
URI = referenceSequence.getHtsPath().getURIString();

But then the whole enclosing if block (starting in line 187) becomes redundant and can be replaced with one line:

Suggested change
}
URI = referenceSequence.getHtsPath().getURIString();

There is no need to test whether the file is remote or not - PicardHtsPath is always backed by a valid URI that includes a protocol scheme. If the file is local, the protocol scheme will be "file://".

This is a great example of why PicardHtsPath and friends (the base classes) are better currency than File/Path/String, etc. The whole idea is to always use PicardHtsPath as the datatype wherever possible, including in function signatures, and then only interconvert to String, Path, or, in the worst case, File, when you absolutely have to, because you're calling some method that you have no control over. In this case, since the URI variable is a String, there is no getting around the conversion to String, but thats a terminal operation, and any intermediate processing should be done on the structured type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a test case for URI, will look around for other places where I did this

/**
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath.
*/
public PicardHtsPath getREFERENCE_SEQUENCE() { return REFERENCE_SEQUENCE; }
Copy link
Contributor Author

@cmnbroad cmnbroad Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be eliminated, and this class and RequiredReferenceArgument both should override getHtsPath (I wish we had better names) to return REFERENCE_SEQUENCE directly. Ideally getHtsPath, and getReferencePath would then become the primary way for tools to obtain the reference (at least for nio-aware tools).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, sounds good. I think getHtsPath is a good name!

@takutosato
Copy link
Contributor

@cmnbroad made the changes you requested, back to you.

@lbergelson @droazen

Copy link
Contributor Author

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpointing a partial review again due to time constraints.

There is also one other issue for this whole PR that we should think about, which is that we're using the concrete class HtsPicardPath everywhere, including in method signatures and return values, but really, we should only be using that when we need a concrete class (i.e., for things annotated with @Argument annotation, or when we need to construct a new instance of one of these objects), but everywhere else we should really just use the IOPath interface. That would make it easier to share code with GATK (though we haven't really done this in GATK either), or to migrate functionality to htsjdk. We might want to discuss this at the team meeting next week.

@Test
public void testExtensionConsistent(){
final String cloudPath = GATKBucketUtils.getTempFilePath(GCloudTestUtils.getTestInputPath(), "txt");
final String localPath = GATKBucketUtils.getTempFilePath("test", "txt");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, GSTKBucketUtils.getTempFilePath would return an HtsPicardPath (or HtsPath if it were to be moved to htsjdk).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were looking at old code here for some reason; this class is now called PicardBucketUtilsTest, and the getTempFilePath returns an object of type PicardHtsPath


// tsato: does it work with the System.getProperty() way? What if the gcloud path was specified as an environment variable?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as Defaults.REFERENCE_FASTA in htsjdk is of type File, I don't think this will work if a gcs path (or any URI with a non-file protocol scheme) is provided as an environment variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that sounds right. thanks.


/**
* @return The reference provided by the user, if any, or the default defined by {@code htsjdk.samtools.Defaults.REFERENCE_FASTA}
*/
@Override
public File getReferenceFile() {
return ReferenceArgumentCollection.getFileSafe(REFERENCE_SEQUENCE, log);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could inherit the javadoc from the base class, but if the javadoc is kept here I would suggest adding a including comment saying that this method is preserved for compatibility, and that getReferencePath and getHtsPath are preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the javadoc here and moved the disclaimer about backward compatibility to the base class too.

}

/**
* @return The reference provided by the user, if any, or the default, if any, as an nio Path.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "may be null".

Copy link
Contributor

@takutosato takutosato Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I delete this javadoc too since the base class has a better, more detailed javadoc?

*
* @return The reference provided by the user or the default as a File. May be null.
*/
File getReferenceFile(); // tsato: update tools that call this method to convert File to PicardHtsPath and use getHtsPath()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to be retained, I would suggest updating the javadoc to say that this method is retained for compatibility, and that getReferencePath and getHtsPath are preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment to yourself on this line can probably be removed now.

}
if (OUTPUT == null) {
OUTPUT = ReferenceSequenceFileFactory.getDefaultDictionaryForReferenceSequence(referenceSequence.getReferenceFile());
final Path outputPath = ReferenceSequenceFileFactory.getDefaultDictionaryForReferenceSequence(referenceSequence.getHtsPath().toPath());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a nit, but this could be getReferencePath instead of getHtsPath.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

try {
Files.delete(OUTPUT.toPath());
} catch (IOException e2) {
throw new PicardException("Unknown problem encountered, and failed to delete the incomplete output.", e2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to include the text of the first exception in the error message, since it describes the original problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added e.getMessage() in the message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of code is a little ugly….

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see you want e given to the exception, not e2

try (final BufferedWriter writer = Files.newBufferedWriter(METRICS_FILE.toPath())){
metricsFile.write(writer);
} catch (IOException e) {
throw new PicardException("Encountered an error writing the metrics file: " + METRICS_FILE.getURIString());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, retain the original exception (pass e to the PicardException constructor) since it describes the actual problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/
public static String getRequesterPaysBucket() {
return getSystemProperty("PICARD_REQUESTER_PAYS_BUCKET", REQUESTER_PAYS_BUCKET_DEFAULT);
return getSystemProperty("PICARD_REQUESTER_PAYS_BUCKET", REQUESTER_PAYS_BUCKET_DEFAULT.getURIString());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is unused, so it could be removed. But if we keep it, it could return an HtsPath.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. @lbergelson let us know if you object since this is your code...

if (append){
return new PicardHtsPath(path.getURIString() + newExtension);
} else {
return new PicardHtsPath(FilenameUtils.removeExtension(path.getURI().toString() + newExtension));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a handy method to have, but I think its a bit tricky to actually get this right in all cases - it will need a good suite of test cases.

As it stands, I think its adding the new extension first, then immediately removing it because the parentheses are wrong. You would need to remove the extension first, and THEN add on the new one:

Suggested change
return new PicardHtsPath(FilenameUtils.removeExtension(path.getURI().toString() + newExtension));
return new PicardHtsPath(FilenameUtils.removeExtension(path.getURI().toString()) + newExtension);

But also I'm a bit leary of relying on the apache FilenameUtils class, since I'm not sure that it handles all forms of the URI correctly.

I think the right way to do this is to use Path::getFileName to get just the file name part of the existing path, massage that to the new file name with the new extension, and then use Path::resolveSibling against the original Path to get the new Path, and then finally recreate a new HtsPath using the URI string resulting from the new path. Again, I think it may be a bit tricky to get right, but it would be super valuable to have.

Also, this could be an instance method on the original HtsPath rather than a static method (since the first arg is a PicardHtsPath anyway). This is also true of the resolve method below. If we can get it right, its a good candidate to move right into htsjdk on the HtsPath class. Maybe @lbergelson has thoughts on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Now using resolveSibling. Will add some tests and see if FilnameUtils can be replaced by something in house. Let's discuss whether we should move this to htsjdk during next week's meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced FileNameUtils::removeExtension with a combination of IOPath::getExtension() and String::replaceAll(). Also added tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for moving this method to resolve, I think replaceExtension() belongs in PicardHtsPath because it returns a new PicardHtsPath object (much like PicardHtsPath::fromPath()). If it's in HtsPath we would have to override it, right?

public static final String GOOGLE_CLOUD_STORAGE_FILESYSTEM_SCHEME = "gs";
public static final String HTTP_FILESYSTEM_PROVIDER_SCHEME = "http";
public static final String HTTPS_FILESYSTEM_PROVIDER_SCHEME = "https";
public static final String HDFS_SCHEME = "hdfs";
public final static String FILE_SCHEME = "file";

// To the extent possible, avoid converting e.g. a PicardHtsPath to a string and calling beginsWith() against these
// string variables. Instead, work directly with PicardHtsPath/HtsPath using e.g. PicardHtsPath::getScheme()
private static final String GCS_PREFIX = "gs://";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad removed these string variables as discussed

@takutosato
Copy link
Contributor

@cmnbroad thanks again for your review. I made the changes you requested. I also took a stab at changing PicardHtsPath to IOPath where appropriate; these changes are in a separate (latest) commit.

Copy link
Contributor Author

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still haven't gotten all the way through, but there is a fair amount of cleanup to be done, mostly in extraneous javadoc comments (I commented where appropriate). Its converging, but slowly - I expect we'll need at least one more round, and maybe two before we can merge.

As for the use of IOPath, there are quite a few more places where that could be used (basically, everywhere EXCEPT where a new object is created (new operator, plus in @argument annotated variables, since those need require a concrete type). I kind of feel like we should do it everywhere possible, or not at all. I'm not sure how many more changes would be required if we try to do it everywhere, so perhaps we can discuss at the next meeting. I hate to not do it in this PR, but maybe its too much.

Also, is it intentional that we retained the hadoop code paths ? I fine either way, just wondering if we need those. Sometime soon we'll need similar support for Azure/AWS, but we might want to discuss whether we want retain the hadoop code paths.


/**
* Base interface for a reference argument collection.
*/
public interface ReferenceArgumentCollection {
/**
* @return The reference provided by the user, if any, or the default, if any.
* This method is retained for backward compatibility with legacy tools that have not been updated to support PicardHtsPath input files.
* The preferred methods for accessing the reference file in the command line argument is either getHtsPath() or getReferencePath().
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The preferred methods for accessing the reference file in the command line argument is either getHtsPath() or getReferencePath().
* The preferred methods for accessing the reference file provided on the command line is either getHtsPath() or getReferencePath().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change locally

*
* @return The reference provided by the user or the default as a File. May be null.
*/
File getReferenceFile(); // tsato: update tools that call this method to convert File to PicardHtsPath and use getHtsPath()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment to yourself on this line can probably be removed now.

Comment on lines 64 to 63
* We do not currently support setting a remote path via an environment variable. For this, we would have to update
* HtsJdk.Defaults.REFERENCE_FASTA to support remote paths.
*
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath. May be null.
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* We do not currently support setting a remote path via an environment variable. For this, we would have to update
* HtsJdk.Defaults.REFERENCE_FASTA to support remote paths.
*
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath. May be null.
*/
* Return the reference provided by the user as a PicardHtsPath. This currently does not support remote paths set via HtsJdk.Defaults.REFERENCE_FASTA, since that uses a `File` object.
*
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath. May be null.
*/

Copy link
Contributor

@takutosato takutosato Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep this comment to myself as a TODO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 73 to 75
/**
* @return A "safe" way to obtain a File object for any reference path.
*
* For files that reside on a local file system, this returns a valid File object. Files that reside on
* a non-local file system can't be accessed via a File object, so return a placeholder File object that
* will defer failure when/if the file is actually accessed. This allows code paths that blindly propagate
* the value returned by calls to getReferenceFile to not get an NPE, and to fail gracefully downstream
* with an error message that includes the reference file specifier.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I wrote this, but a couple of proposed changes:

Suggested change
/**
* @return A "safe" way to obtain a File object for any reference path.
*
* For files that reside on a local file system, this returns a valid File object. Files that reside on
* a non-local file system can't be accessed via a File object, so return a placeholder File object that
* will defer failure when/if the file is actually accessed. This allows code paths that blindly propagate
* the value returned by calls to getReferenceFile to not get an NPE, and to fail gracefully downstream
* with an error message that includes the reference file specifier.
/**
* @return A "safe" File object for any reference path.
*
* For files that reside on a local file system, this returns a valid File object. Files that reside on
* a non-local file system can't be accessed via a File object, so return a placeholder File object that
* will defer failure until the the File object is actually accessed. This allows code paths that blindly propagate
* the value returned by calls to getReferenceFile to not get an NPE, and to fail gracefully downstream
* with an error message that includes the reference file specifier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorporated


@Argument(shortName = StandardOptionDefinitions.REFERENCE_SHORT_NAME, doc = "Reference sequence file.", common = false)
public File REFERENCE_SEQUENCE;
@Argument(shortName = StandardOptionDefinitions.REFERENCE_SHORT_NAME, doc = "Reference sequence file.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok.

@@ -66,13 +64,12 @@ public static String getTempFilePath(String prefix, String extension){
* @param prefix The beginning of the file name
* @param suffix The end of the file name, e.g. ".tmp"
*/
private static String randomRemotePath(String stagingLocation, String prefix, String suffix) {
private static Path randomRemotePath(String stagingLocation, String prefix, String suffix) {
if (isGcsUrl(stagingLocation)) {
// Go through URI because Path.toString isn't guaranteed to include the "gs://" prefix.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this comment was already here, but I'm not sure what it refers to. I think its obsolete or something since it seems out or place (and possibly even wrong). I'd remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

GATKUtils.nonNull(path);
return path.startsWith(GCS_PREFIX);
return path.startsWith("gs://");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the GOOGLE_CLOUD_STORAGE_FILESYSTEM_SCHEME constant here (similar to the isHadoopUrl(String) one is written).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

new File(file.getAbsolutePath() + FileExtensions.TABIX_INDEX).deleteOnExit();
new File(file.getAbsolutePath() + ".bai").deleteOnExit();
new File(file.getAbsolutePath() + ".md5").deleteOnExit();
new File(file.getAbsolutePath().replaceAll(extension + "$", ".bai")).deleteOnExit();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the FileExtensions constants for these wherever there is on available (i.e., FileExtensions.BAI_INDEX).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I know its like this in GATK, but I'd make the same comment as elsewhere about putting the .bai variants adjacent to each other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

/**
*
* tsato: convert to string
*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment ? Obsolete ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 56 to 62
/**
* This default implementation is here to support tools that have not been upgraded to support cloud reference files.
* (The alternative is to make it abstract, which necessitates we update all the classes that implement this interface with
* an implementation like what we have below).
*
* The cloud-reference-enabled tools should override this method e.g. return REFERENCE_SEQUENCE (see DownsampleSam::makeReferenceArgumentCollection)
* Once all the relevant subclasses have been updated, we plan to make this abstract.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is somewhat out of place here and a bit confusing. Either way, I think the override cases are the exception, and should be (and I think are) documented where they occur. The javadoc here should really just describe the method itself. I would suggest the simplifying:

Suggested change
/**
* This default implementation is here to support tools that have not been upgraded to support cloud reference files.
* (The alternative is to make it abstract, which necessitates we update all the classes that implement this interface with
* an implementation like what we have below).
*
* The cloud-reference-enabled tools should override this method e.g. return REFERENCE_SEQUENCE (see DownsampleSam::makeReferenceArgumentCollection)
* Once all the relevant subclasses have been updated, we plan to make this abstract.
/**
* Returns a PicardHtsPath for the reference input. May be null. Implementers of this interface should override with an appropriate implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@takutosato
Copy link
Contributor

@cmnbroad I made the changes you requested and otherwise cleaned up comments and code. Back to you. Thanks again for reviewing.

@takutosato
Copy link
Contributor

Before this PR is merged, the accompanying htsjdk PR must be merged first. samtools/htsjdk#1681

Copy link
Contributor Author

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer, but still a few more issues to resolve (also, I finally had time this time to look at all the code, so hopefully this finally covers all of the issues).


/**
* @return The reference provided by the user, if any, or the default defined by {@code htsjdk.samtools.Defaults.REFERENCE_FASTA}
* @return The reference provided by the user, if any, or the default, if any, as an nio Path. May be null. (tsato: remove this javadoc since it's already present in the base class?)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd remove the javadoc from this class too, like you've done for the required one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* there is no guarantee that this will be used as the root of the tmp file name, a local prefix may be placed in the tmp folder for example
* @param extension and extension for the temporary file path, the resulting path will end in this
* @return a path to use as a temporary file, on remote file systems which don't support an atomic tmp file reservation a path is chosen with a long randomized name
* @param directory the directory where the temporary fill will be placed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this method is mirroring what the GATK one does, but it has weird semantics in that this param is only used if it specifies a remote path. Given that, I would recommend including the language used in the GATK version of that that says that there is no guaranty that the file will actually be placed in this directory (or maybe, it seems like this is only used for remote files, so you could even be explicit and say that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the behavior of prefix when creating a local tmp file was incorrect here. Code updated. Thanks for catching.

@@ -83,7 +78,7 @@ private static String randomRemotePath(String stagingLocation, String prefix, St
* on Spark because using the fat, shaded jar breaks the registration of the GCS FilesystemProvider.
* To transform other types of string URLs into Paths, use IOUtils.getPath instead.
*/
public static java.nio.file.Path getPathOnGcs(String gcsUrl) {
public static CloudStoragePath getPathOnGcs(String gcsUrl) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this can't be private ? It's currently only used within this file, and it would be great to prevent it from proliferating further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -92,22 +87,24 @@ public static java.nio.file.Path getPathOnGcs(String gcsUrl) {
}

/**
* Note: to the extent possible, avoid converting an HtsPath object to a String then calling this method.
* Instead, use the same method below that takes in a PicardHtsPath as input.
*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'm all for discouraging use of this method, and its private anyway (which is good), I would either remove this comment entirely or else clarify the reason for it with something that explains that we want to discourage use of this pattern of manipulating Strings. (Also, it no longer matches the signature of the actual method below). As it stands I don't think the comment will make sense to someone who doesn't already know the reason for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

I would like to document "best practices" for manipulating PicardHtsPath/IOPath objects --- e.g. avoid converting to a string then editing it…perhaps at the top of a some class as a comment. I will look for a good place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - would love to see the best practices documentation for this somewhere (maybe htsjdk, since that where the base implementation is). But I think thats a tall order, especially since we're still ironing those out via this code review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, sounds good. I will just keep the notes in a private file for now, and maybe we'll add it to the repo one day.

* @param append If set to true, append the new extension to the original basename. If false, replace the original extension
* with the new extension. If append = false and the original name has no extension, an exception will be thrown.
* @param newExtension A new file extension. Must include the leading dot e.g. ".txt", ".bam"
* @return A new PicardHtsPath object with the new extension
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention that it only replaces the LAST component of an extension, i.e., given "my.fasta.gz, it will only replace the ".gz" part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{"testdata/picard/sam/test.bam", ".bai", true, new File("testdata/picard/sam/test.bam.bai").getAbsolutePath()},
{"/Users/jdoe/workspace/picard/testdata/picard/sam/test.bam", ".bai", false, "/Users/jdoe/workspace/picard/testdata/picard/sam/test.bai"},
{"/Users/jdoe/workspace/picard/testdata/picard/sam/test.bam", ".md5", true, "/Users/jdoe/workspace/picard/testdata/picard/sam/test.bam.md5"}
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest some test cases with query params - see my comment on the PicardHtsPath changes.

} catch (IOException e2) {
throw new PicardException("Unknown problem encountered, and failed to delete the incomplete output.", e);
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall if we talked about this previously, but why is this second deletion attempt necessary ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary; I removed the first attempt, OUTPUT.toPath().toFile().delete();, which as you pointed out probably doesn't work on a remote path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks.

} catch (IllegalArgumentException e) {
// in case of an unexpected error delete the file so that there isn't a
// truncated result which might be valid yet wrong.
OUTPUT.delete();
OUTPUT.toPath().toFile().delete();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - does OUTPUT.toPath().toFile() return a legitimate, useable File for a remote path ?

Comment on lines 126 to 129
* Takes an IOPath and returns a new PicardHtsPath object that keeps the same basename as the original but has
* a new extension. If the input path is a relative local path, it is converted to an absolute path via
* PicardHtsPath::toPath (we might change this behavior and avoid converting to an absolute path in the future,
* by for instance working directly with a string).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior described here is a bit of an implementation detail, and as you mention here, could be fixed by changing the implementation start with the rawInputString (I'm not suggesting we do that now - since I think as a practical matter this only impacts your tests because you want to compare the old and new paths). Anyway I'd change this comment slightly if you want to retain it:

Suggested change
* Takes an IOPath and returns a new PicardHtsPath object that keeps the same basename as the original but has
* a new extension. If the input path is a relative local path, it is converted to an absolute path via
* PicardHtsPath::toPath (we might change this behavior and avoid converting to an absolute path in the future,
* by for instance working directly with a string).
* Takes an IOPath and returns a new PicardHtsPath object that keeps the same basename as the original but has
* a new extension. If the input path was created from a rawInputString that specifies a relative local path, the new path will
* always have a rawInputString that specifies an absolute path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the javadoc is meant for the users of the function and not where the developers describe the implementation details. At the same time I think this is unnatural behavior, and the reason why it's implemented this way should be documented. I think here in the javadoc seems the most fitting place to put this information (the second best place might be the right before we return the output, because that's where this local to absolute path conversion happens under the hood).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that it this behavior is nonintuitive, and I'm fine with retaining this here, but I still suggest the change I proposed. In your text, the term "input path" is ambiguous (I interpret it to mean to the path used to create the original object). So I tried to fix that in my proposed text to eliminate the ambiguity by being a bit more explicit. At any rate, do whatever you think is best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I incorporated your text.


public class PicardBucketUtilsTest {

// Check that the extension scheme (e.g. "txt" vs ".txt" as the second argument) is consistent for cloud and local files
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is obsolete since there is no longer a "txt" vs ".txt" issue (and also the reference to the "second argument" is a bit ambiguous given that there are two overloads being used here). Maybe:

Suggested change
// Check that the extension scheme (e.g. "txt" vs ".txt" as the second argument) is consistent for cloud and local files
// Check that the extension scheme is consistent for cloud and local files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@takutosato
Copy link
Contributor

@cmnbroad thanks again for your edits. I responded to all your comments. Back to you!

Copy link
Contributor Author

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're almost there- I left a couple of responses to Takuto's questions. This also has conflicts and needs a rebase.

@@ -92,22 +87,24 @@ public static java.nio.file.Path getPathOnGcs(String gcsUrl) {
}

/**
* Note: to the extent possible, avoid converting an HtsPath object to a String then calling this method.
* Instead, use the same method below that takes in a PicardHtsPath as input.
*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - would love to see the best practices documentation for this somewhere (maybe htsjdk, since that where the base implementation is). But I think thats a tall order, especially since we're still ironing those out via this code review.

Comment on lines 144 to 177
if (append){
return new PicardHtsPath(path.getURIString() + newExtension);
} else {
final Optional<String> oldExtension = path.getExtension();

if (oldExtension.isEmpty()){
throw new PicardException("The original path must have an extension when append = false: " + path.getURIString());
}

final String oldFileName = path.toPath().getFileName().toString();
final String newFileName = oldFileName.replaceAll(oldExtension.get() + "$", newExtension);
return PicardHtsPath.fromPath(path.toPath().resolveSibling(newFileName));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that, but a file extension, and the file name part of a URI, are strings. So you're right that in those cases its unavoidable, but I think those are fine. My objection/obsession is with turning the entire URI into a string and then trying to parse/extend it.

As for question 1, I don't think its a matter of telling users to clean up the URIs; GCS, Azure, refget, DRS, etc., all use query parameters to support various features. If we don't properly support query parameters, I think we will have entirely missed the boat. And I don't actually think its that hard to do (though I know it probably feels a bit painful in this first PR!). My fear is if we don't get it right in this PR, it will really hard to retrofit.

The second question is a good one. I can certainly imagine cases where the user doesn't want to preserve the query parameters when changing the extension, but we can add methods to support that in a future PR without breaking backward compatibility. For now I would implement what you proposed, that is, preserve them (http://some/file.txt.log? http://some/file.txt.log?someParam=someValue).

* @param targetDir Directory in which to create the temp file. If null, the default temp directory is used.
* @return A file in the temporary directory starting with name, ending with extension, which will be deleted after the program exits.
*/
public static File createTempFileInDirectory(final String name, String extension, final File targetDir) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this can wait.

} catch (IOException e2) {
throw new PicardException("Unknown problem encountered, and failed to delete the incomplete output.", e);
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks.

Comment on lines 126 to 129
* Takes an IOPath and returns a new PicardHtsPath object that keeps the same basename as the original but has
* a new extension. If the input path is a relative local path, it is converted to an absolute path via
* PicardHtsPath::toPath (we might change this behavior and avoid converting to an absolute path in the future,
* by for instance working directly with a string).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that it this behavior is nonintuitive, and I'm fine with retaining this here, but I still suggest the change I proposed. In your text, the term "input path" is ambiguous (I interpret it to mean to the path used to create the original object). So I tried to fix that in my proposed text to eliminate the ambiguity by being a bit more explicit. At any rate, do whatever you think is best.


// Test cram (input/output)
// Note: this test is very slow---takes about 5 min on Broad internal network
final boolean testCram = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow - thats surprising. I was going to try to debug it, but I don't have you htsjdk changes. I would say we should retain the test (these run in separate cloud test job in CI anyway, right ?), and cleanup the code to remove the references to your local file system, remove the unused variables, etc.

@cmnbroad
Copy link
Contributor Author

@tsato Not sure where this stands (we might be waiting for each other ?). It seems like most things have either been addressed or dismissed. Unless you have more changes to make, the conflicts should probably be resolved and then I can take one final look (though I'll be out the next two weeks).

@takutosato
Copy link
Contributor

Hi @cmnbroad, sorry about the delay. It took me a while to test all the different cases for the cram performance issue that I was looking at with @lbergelson. Should be able to push my changes soon (hopefully tonight) so they should be ready when you return in two weeks.

@takutosato
Copy link
Contributor

@cmnbroad it's ready for your final review. One disappointment is that for the reasons I listed above in comment, I could not write tests for query params (http:// paths are currently not supported in Picard, and gcloud actively rejects URIs with a query parameter). But other than that I incorporated all the changes you suggested.

@takutosato
Copy link
Contributor

(Merge conflict is pretty minor so I will resolve it when it's ready to merge)

…mentCollection implementations to use nio paths.

Change CleanSam to use getReferencePath.

preliminary changes

test passing for now

CreateSequenceDictionary looks good

working on downsamplesam

temporary as I fix the issue

remove 2

just a commit

near ready

another checkpoint

checkpoint

all good except maybe the cram

ready

chris comments no 1

review edits 1

done proofreading

chris edit 2

PicardHtsPath -> IOPath where appropriate

another edit

another round of edits

louis

update htsjdk

chris suggestion replaceExtension

Co-authored-by: Chris Norman <[email protected]>

chris edit

reorganize

proofread

manually resolve conflicts

khalid conflict that was not caught

some fix

fixed more fails
Copy link
Contributor Author

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good now - down to a couple of minor requests (plus we still need to move the code Khalid added, and decide if we want to replace the CRAM tests with something that doesn't use a full reference).

Comment on lines 19 to 20
{"testdata/picard/sam/test.bam", ".bai", false, new File("testdata/picard/sam/test.bai").getAbsolutePath()}, // a relative input will be converted to absolute input
{"testdata/picard/sam/test.bam", ".bai", true, new File("testdata/picard/sam/test.bam.bai").getAbsolutePath()},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to have the REPLACE/APPEND aliases for true/false, you might as well use them everywhere. Otherwise it looks like something strange is going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a good idea to add some test cases that have the replacement extension embedded in the pathname somewhere in addition to being at the end, to make sure the replacement only happens at the end, i.e.:
{"gs://hellbender/test/resources.fasta/hg19mini.fasta", ".fai", REPLACE, "gs://hellbender/test/resources.fasta/hg19mini.fai"}, {"gs://hellbender/test/resources.fasta/hg19mini.fasta", ".fai", APPEND, "gs://hellbender/test/resources.fasta/hg19mini.fasta.fai"},

as well as some cases where the input has no extension:

{"gs://hellbender/test/resources/hg19mini", ".fai", APPEND, "gs://hellbender/test/resources/hg19mini.fai"},

and also a new test for negative case(s?) that throw:

{"gs://hellbender/test/resources/hg19mini", ".fai", REPLACE},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added these cases (and made the failure cases a separate test; let me know if there is a better way)

try {
Files.delete(OUTPUT.toPath());
} catch (IOException e2) {
throw new PicardException("Unknown problem encountered, and failed to delete the incomplete output.", e);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we need the second attempt at deletion here, but I thought you had fixed this to include the text of e2.getMessage().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I'm not sure why this change got lost, but I put it back in; removed the first delete attempt and added e2.message() to the string given by PicardException

@@ -71,7 +71,7 @@ public class CleanSam extends CommandLineProgram {
protected int doWork() {
IOUtil.assertFileIsReadable(INPUT);
IOUtil.assertFileIsWritable(OUTPUT);
final SamReaderFactory factory = SamReaderFactory.makeDefault().referenceSequence(REFERENCE_SEQUENCE);
final SamReaderFactory factory = SamReaderFactory.makeDefault().referenceSequence(referenceSequence.getReferencePath());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to change this for the reader, we should also do it for the writer (line 79). Right now it looks like two different references. I'd either revert this change, or make it consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to revert this change and cloudify the tool in a separate PR.

@@ -705,7 +705,7 @@ public void testUnmapBacterialContamination() throws IOException {

//yuck!
final RequiredReferenceArgumentCollection requiredReferenceArgumentCollection = new RequiredReferenceArgumentCollection();
requiredReferenceArgumentCollection.REFERENCE_SEQUENCE = reference;
requiredReferenceArgumentCollection.REFERENCE_SEQUENCE = new PicardHtsPath(reference.getAbsolutePath());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be bit more straightforward to just use reference.getHtsPath() here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you meant requiredReferenceArgumentCollection.REFERENCE_SEQUENCE = new PicardHtsPath(reference);?

@takutosato
Copy link
Contributor

Thanks again for your review @cmnbroad.

When using the full hg19 reference vs the chr20 + 21 reference for the cram test, the runtimes are actually roughly the same because in both cases we download the entire chromosome 20. So it doesn't seem to matter which one we use (I picked the 20 + 21 one).

Copy link
Contributor Author

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now! Approved (in spirit anyway - github won't let me merge it because its not marked as approved, and it won't let me mark it as approved, I assume because it was originally my branch and I authored some of the commits).

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGT@cmnbroad

@cmnbroad cmnbroad merged commit 1abb61f into master Feb 5, 2024
6 checks passed
@cmnbroad cmnbroad deleted the cn_nio_reference_files branch February 5, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants