-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor to NumpyArray and back requires copy #8
Comments
No need for benchmark, the copy are explicit :
And potentially if Tensor is not contiguous :
Why is a copy performed ? Because having two objects that refers to the same memory is the best way to end up with dangling pointers. for me, implementing a non-robust solution that I know will crash is bad design. The goal of this was to potentially extend Arraymancer API through Numpy - since Python is usually quite slow anyway optimising for performance wasn't the top priority. Solutions ?
PRs are welcome if you want to give it a try, I don't really have the time to work on it these days. |
Turning Numpy to Tensor without copy seems already possible with
import std/dynlib
import nimpy
import nimpy/py_lib
import arraymancer
import scinim/numpyarrays
block:
type TestType = int64
const testSize = 1000
var originalSeq = newSeq[TestType](testSize)
# total heap usage: 1 allocs, 1 frees, 8,008 bytes allocated
# All heap blocks were freed -- no leaks are possible
let np = pyImport("numpy")
# + 34,790,614
# total heap usage: 10,577 allocs, 7,255 frees, 34,798,622 bytes allocated
# definitely lost: 192 bytes in 2 blocks
var pyObj = np.array(originalSeq)
# + 16,000
# total heap usage: 10,579 allocs, 7,257 frees, 34,814,622 bytes allocated
# definitely lost: 192 bytes in 2 blocks
# Create Tensor with fromBuffer (zero-copy)
var pyNd = pyObj.asNumpyArray[: TestType]
var tensor = fromBuffer(pyNd.data, pyNd.shape)
# + 308 (!)
# total heap usage: 10,584 allocs, 7,262 frees, 34,814,930 bytes allocated
# definitely lost: 192 bytes in 2 blocks
# Test r/w Numpy Array vs Tensor
for i in 0..<testSize:
assert pyObj[i].to(TestType) == tensor[i]
pyObj[i] = i mod 5
assert tensor[i] == i mod 5
# + 0
# total heap usage: 10,584 allocs, 7,262 frees, 34,814,930 bytes allocated
# definitely lost: 192 bytes in 2 blocks
{.pragma: pyfunc, cdecl, gcsafe.}
let Py_FinalizeEx = cast[proc():int{.pyfunc.}](py_lib.pyLib.module.symAddr("Py_FinalizeEx"))
assert Py_FinalizeEx != nil
assert Py_FinalizeEx() == 0
# + 19,474
# total heap usage: 10,596 allocs, 9,567 frees, 34,834,404 bytes allocated
# definitely lost: 192 bytes in 2 blocks --gc: "arc"
--d: "release"
--opt: "speed"
--d: "useMalloc" I've also wrapped some Numpy C-API but I've not found any valuable advantage from the already implemented Buffer Protocol. I'll check if there's any func there to help moving from Tensor to Numpy, the only sensible way I've found so far seems quite convoluted https://stackoverflow.com/a/2925014/17274026 |
Yes, but that's not the limitating factor. The issue is making sure the memory is not free / moved / resized by Python; otherwise your Tensor will points to invalid memory. Using the C function in conjonction with https://github.com/yglukhov/nimpy/blob/master/nimpy/py_utils.nim#L9-L16 Py_IncRef and Py_DecRef should allow you to bind a C-array (i.e. Note that If the PyBuffer API already allows you to do this, then you won't need the C API of Numpy. See also questions like : https://stackoverflow.com/questions/52731884/pyarray-simplenewfromdata
Are you sure it's leaked memory and not just memory allocated once for the lifetime of the program and thus not free'ed because the OS will reclaim it anyway (note : i'm not a fan either of this way of doing things, but it's not wrong). Doing multiple |
I might have a better solution, but it adds a python dependency: apache arrow. I'm using it to build a pyarrow.Buffer object (that is an object compatible with python Buffer protocol) from a seq/Tensor using the See https://arrow.apache.org/docs/python/memory.html Here's an example where overhead by step [bytes]:
This approach has a double advantage: arrow is not only a bridge for numpy, but a memory representation mode, a lingua franca. For example it may tackle for us zero-copy integration with Moreover as I've experimented here Apache Arrow C-API (gobject based) is already easily wrapped with futhark, and the inner Arrow C Data Interface aims to be a stable abi. import std/[
dynlib,
sequtils
]
import nimpy
import nimpy/py_lib
import arraymancer
proc Py_FinalizeEx =
{.pragma: pyfunc, cdecl, gcsafe.}
let aux = cast[proc():int{.pyfunc.}](py_lib.pyLib.module.symAddr("Py_FinalizeEx"))
assert aux != nil
assert aux() == 0
proc test(T: typedesc, testSize: int) =
var allocSeq = toSeq(0.T..testSize.T)
# data 8000 + overhead 8
# total heap usage: 1 allocs, 1 frees, 8,008 bytes allocated
# All heap blocks were freed -- no leaks are possible
var zcTensor = fromBuffer(cast[ptr UncheckedArray[T]](allocSeq[0].addr), allocSeq.len)
# overhead 32
# total heap usage: 2 allocs, 2 frees, 8,040 bytes allocated
# All heap blocks were freed -- no leaks are possible
for i in 0..<testSize:
assert allocSeq[i] == zcTensor[i]
block:
let
pa = pyImport("pyarrow")
np = pyImport("numpy")
sys = pyImport("sys")
gc = pyImport("gc")
# overhead 36.431.083 (python initialization)
# total heap usage: 14,392 allocs, 10,596 frees, 36,439,123 bytes allocated
# https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html
let paBuffer = pa.foreign_buffer(cast[int](zcTensor.asContiguous(rowMajor).toUnsafeView), T.sizeof * allocSeq.len)
# overhead 232
# total heap usage: 14,398 allocs, 10,602 frees, 36,439,355 bytes allocated
# echo sys.getrefcount(paBuffer) -> 2
block:
# https://numpy.org/doc/stable/reference/generated/numpy.frombuffer.html
let npDtype = np.getAttr($T)
let npArray = np.callMethod("frombuffer", paBuffer, npDtype)
# overhead 0
# total heap usage: 14,398 allocs, 10,602 frees, 36,439,355 bytes allocated
# echo sys.getrefcount(paBuffer) -> 3
for i in 0..<testSize:
assert zcTensor[i] == npArray[i].to(T)
discard gc.collect()
# overhead 0
# total heap usage: 14,398 allocs, 10,602 frees, 36,439,355 bytes allocated
# npArray has been deallocated
# echo sys.getrefcount(paBuffer) -> 2
Py_FinalizeEx()
# overhead 18.488
# total heap usage: 14,409 allocs, 13,095 frees, 36,457,843 bytes allocated
# python interpreter is closed, original buffer is still in place
for i in 0..<testSize:
assert allocSeq[i] == zcTensor[i]
# overhead 0
# total heap usage: 14,409 allocs, 13,095 frees, 36,457,843 bytes allocated
test(int64, 1000) |
I've been doing some memory profiling with valgrind
It seems that Tensor to NumpyArray and back requires a copy for every kind of conversion, and this makes NumpyArray costly to use.
Is there a solution for this?
config.nims
the results under each line is a different compilation and run.
The result shows that an alloc the size of the original Tensor object is required on each transform from Tensor to NumpyArray and vive-versa
The text was updated successfully, but these errors were encountered: