Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should datashape names/types be mutable/settable? #180

Open
dan-coates opened this issue Sep 15, 2015 · 3 comments
Open

Should datashape names/types be mutable/settable? #180

dan-coates opened this issue Sep 15, 2015 · 3 comments
Milestone

Comments

@dan-coates
Copy link
Contributor

The current mechanism for changing anything in a datashape seems to be constructing a new string for dshape with the needed changes and creating a new datashape. Most of the attributes for important datashape objects like Datashape and Record are defined as properties which cannot be altered. One could try to directly modify the _parameters attributes, but that's a pretty ugly hack and _parameters is often a tuple of tuples, which is immutable anyway.

With fairly large datashapes where one needs to alter one name or type in a record, it would be much easier to have a method or other way to change the name/type in place rather than having to construct an entirely new dshape string to build a new datashape.

This feels like it should be doable, but I'm not sure if allowing Record names/types to be altered has some potential downsides from an architectural perspective that outweigh the benefits.

@llllllllll
Copy link
Member

I think that most things assume that the dshapes are immutable. maybe we could implement something like namedtuple._replace which returns a copy with the changed values.

Also to note, I don't use string formatting to change fields, for example, with a record, you can say:

od = OrderedDict(some_record.fields)
od[some_field] = int32
Record(od)

@cpcloud
Copy link
Member

cpcloud commented Oct 5, 2015

Mutability is very important for datashape because we depend on it being hashable in blaze.

I agree that a toplevel swap or replace(dshape, old_sub_dshape, new_sub_dshape) function would be useful

OTOH @octophat would an argument like typehints={'field_name': 'int64'} cover the use case you're thinking of?

@cpcloud cpcloud assigned cpcloud and unassigned cpcloud Oct 5, 2015
@cpcloud cpcloud added this to the 0.4.8 milestone Oct 5, 2015
@dan-coates
Copy link
Contributor Author

Actually, the main place I've needed this so far is in changing the name of a field, rather than its type. I'm doing this to avoid reserved words as column names in Teradata. It's a very manual process right now and actually leading to some bugs as I don't think we always construct a new dshape appropriately (obviously this is on my crappy code and not datashape, but just pointing out how having to handle this manually can lead to bugs).

I think the typehints argument would work well for fields where you want to change the type and you know it ahead of time, but it wouldn't work for the name changing use case I have and also wouldn't work great if you want to do a discover, eyeball the datashape, then change something (it could work for that but would involve recreating the datashape which seems inefficient). Being able to modify a datashape in place or without rescanning the source data would be ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants