Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37938: [Swift] Add initial C data interface implementation #41342

Merged
merged 1 commit into from
May 29, 2024

Conversation

abandy
Copy link
Contributor

@abandy abandy commented Apr 23, 2024

Continuation for PR: #39091

This add an initial implementation of the C Data interface for swift. During development it was found that null count was not being properly maintained on the arrow buffers and this change is included as well. Also some minor refactoring was done to existing sources to enable this feature.

This has been tested from Swift calling into C to import data but not from Swift to C exporting data. Test is currently ongoing.

Copy link

⚠️ GitHub issue #37938 has been automatically assigned in GitHub to PR creator.

@abandy
Copy link
Contributor Author

abandy commented Apr 23, 2024

The go test for the C data interface are currently not included in CI testing. I plan to update this in a future PR.

@dbtsai
Copy link
Member

dbtsai commented Apr 23, 2024

It’s great to add those interfaces to add Swift support.

Cc @andygrove @viirya

Thanks!

@@ -40,4 +40,4 @@ popd
source_dir=${1}/swift/ArrowFlight
pushd ${source_dir}
swift test
popd
popd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 23, 2024
@abandy abandy force-pushed the GH-37938 branch 2 times, most recently from 739c452 to c269261 Compare April 24, 2024 00:11
@abandy
Copy link
Contributor Author

abandy commented Apr 24, 2024

It’s great to add those interfaces to add Swift support.

Cc @andygrove @viirya

Thanks!

Np, thank you!

@kou
Copy link
Member

kou commented Apr 24, 2024

@dbtsai Are you familiar with Swift? If so, could you also review this carefully?

Comment on lines 49 to 50
name: "ArrowC", // your C/C++ library's name
path: "Sources/ArrowC" // your path to the C/C++ library
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments look like from some examples and maybe unrelated here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I will remove.

Comment on lines +114 to +117
cArray.n_children = 0
cArray.children = nil
cArray.dictionary = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this for TODO later? Otherwise I think n_children should not be zero always? And also for children and dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Swift version of Arrow currently doesn't support having children (Struct type) or dictionaries. I will add a comment mentioning this so it can be updated when those constructs are added.

let arraySchema = data!.pointee
let exportId = Int(bitPattern: arraySchema.private_data)
guard ArrowCExporter.exportedData[exportId] != nil else {
fatalError("Export schema not found with id \(exportId)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does fatalError actually do something like C exit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the behavior is the same. Description from Swift docs for fatalError: Unconditionally prints a given message and stops execution.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting committer review Awaiting committer review labels Apr 24, 2024
@github-actions github-actions bot added awaiting review Awaiting review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Apr 24, 2024
@abandy
Copy link
Contributor Author

abandy commented Apr 28, 2024

@kou and @pitrou Please review when you get a chance. This is very similar to the previous PR(#39091) for this enhancement that was closed but with the updates from @pitrou previous comments. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this file to swiftlint targets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added ArrowFlight and CDataWGo Package files as well, thanks!

@abandy
Copy link
Contributor Author

abandy commented May 15, 2024

@kou I have made the requested changes. Please review again.

@pitrou
Copy link
Member

pitrou commented May 16, 2024

@abandy Could you please rebase from git main, if not recently done, so that we get cleaner CI results?

}
}

private static var exportedData = [Int: ExportData]()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this container cleaned up at shutdown? I'm wondering what happens if a release callback is called after this container is cleaned up. Presumably the release callback would crash because of the fatalError?

This is not necessarily a problem, but I'm curious :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the memory should be released during shutdown. The container should be valid in the process that loaded the library until the process has been shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my question was more along the lines of: what happens if the release callback is called after finalization of Swift internals has started? Granted, we might not be able to do much about it, but it's probably good to keep this in mind (and/or add a comment about it).

self.arrowType = arrowType
// keeping the name str to ensure the cstring buffer remains valid
self.name = name
self.arrowTypeName = (try arrowType.cDataFormatId as NSString).utf8String!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only ok because arrowType.cDataFormatId always returns a static string, right?
Otherwise you would also have to store arrowType.cDataFormatId in ExportSchema so that its buffer remains valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I suppose that's not very future-proof, then? Some types such as timestamp, union... have dynamic format strings which embody the type's parameters. You may want to add a comment to remind readers about this?

(or perhaps simply arrange to keep the arrowType.cDataFormatId result alive in the ExportSchema class)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR ready for Struct types and plan to follow up with Unions afterward. I was going to address this issue when I added support for these types to the C data interface in a future PR but I will make this change now so it is one less thing that will be an issue at that time. Thank you!

Comment on lines 74 to 78
return .failure(.invalid("Children currently not supported"))
} else if cSchema.dictionary != nil {
return .failure(.invalid("Dictinoary types currently not supported"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also call ArrowCImporter.release(cSchema) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, agreed, I will update.

case .success(let holder):
return .success(ImportArrayHolder(holder, cArrayPtr: cArrayPtr))
case .failure(let err):
return .failure(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also call ArrowCImporter.release here?

@abandy
Copy link
Contributor Author

abandy commented May 23, 2024

@pitrou and @kou I have made the requested changes. Please review again. I have 2 PRs waiting for some of theses changes (one is for adding struct type) and it would be great if we could get this merged so I can submit them. Thank you!

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, smaller PRs are easy to review/merge than one large PR.
If you want to proceed quickly as much as possible, could you use the smaller PRs approach next time like you will do for struct type?
For example, we might be able to choose the following approach for this case:

  1. Implement ArrowSchema export for 1 simple type such as bool.
  2. Implement ArrowSchema import for 1.
  3. Implement ArrowSchema export/import for one more simple type such as int8.
  4. ...

FYI: MATLAB uses the small PRs approach: https://github.com/apache/arrow/issues?q=is%3Aissue+matlab+%22C+data+interface%22

@@ -16,10 +16,15 @@
# under the License.

included:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 24, 2024
@abandy
Copy link
Contributor Author

abandy commented May 24, 2024

In general, smaller PRs are easy to review/merge than one large PR. If you want to proceed quickly as much as possible, could you use the smaller PRs approach next time like you will do for struct type? For example, we might be able to choose the following approach for this case:

  1. Implement ArrowSchema export for 1 simple type such as bool.
  2. Implement ArrowSchema import for 1.
  3. Implement ArrowSchema export/import for one more simple type such as int8.
  4. ...

FYI: MATLAB uses the small PRs approach: https://github.com/apache/arrow/issues?q=is%3Aissue+matlab+%22C+data+interface%22

Hi @kou, thank you example on the smaller project. I will definitely follow that format for the struct type change! Thank you very much!

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 24, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abandy This LGTM in general, I've just posted two comments about potential corner cases. I'll let @kou give the final go.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I added some minor comments but I'll merge this. If we need to work on these comments, let's work on them as follow-up tasks.

#define ARROW_FLAG_NULLABLE 2
#define ARROW_FLAG_MAP_KEYS_SORTED 4

#include <stdio.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

#define ARROW_FLAG_MAP_KEYS_SORTED 4

#include <stdio.h>
#include <stdlib.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to ArrowCData.c?


#include <stdio.h>
#include <stdlib.h>
#include <stdint.h> // Must have this!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <stdint.h> // Must have this!
#include <stdint.h> // For int64_t

require (
github.com/andybalholm/brotli v1.0.5 // indirect
github.com/apache/arrow/go/v12 v12.0.1 // indirect
github.com/apache/arrow/go/v16 v16.0.0-20240203105949-22f2cfd1e1eb // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really indirect dependency?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 29, 2024
@kou kou changed the title GH-37938: [Swift] initial impl of C Data interface GH-37938: [Swift] Add initial C data interface implementation May 29, 2024
@kou kou merged commit 4d524eb into apache:main May 29, 2024
53 of 55 checks passed
@kou kou removed the awaiting merge Awaiting merge label May 29, 2024
@abandy
Copy link
Contributor Author

abandy commented May 29, 2024

+1

I added some minor comments but I'll merge this. If we need to work on these comments, let's work on them as follow-up tasks.

Thank you @kou, @pitrou, @dbtsai, @viirya for the reviews!!

I will create a MINOR pr for the comments added by @kou.

kou pushed a commit that referenced this pull request May 29, 2024
### Rationale for this change

Follow up changes requested in the #41342

### What changes are included in this PR?

Update includes header file and go dependencies changes related to the C Data interface changes.

Authored-by: Alva Bandy <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 4d524eb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

@abandy abandy deleted the GH-37938 branch June 10, 2024 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants