You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a few terms used in DataChain which I think we need to define more clearly/consistently, especially in docs and any blogposts etc (or maybe I just missed the fact that have this nomeclature already :-))
DataChain:
the product name and the python library :-)
But also the object we work with in DataChain, basically what elsewhere is usually called a dataframe or a table ... maybe we can call these DataChain tables?
Dataset :
A dataset is a persisted DataChain? Should we call it a DataChain dataset? I would probably just say a persisted datachain or a persisted table (if we call the instances of DataChain class tables)
Column vs signal:
we have tables with hierarchical columns and we sometimes call them columns and sometimes signals
I would just use the word columns everywhere and forget about signals because everyone knows what table columns and signals are more vague.
We need to be able to clearly differentiate between a column like file which contains a collection of lower level columns ( file.path, file.version , or even file.foo.bar) and single level columns (e.g. ones created by the users). Pandas has a similar concept with the index and they then call it a MultiIndex (or a hierarchical index). So we could then perhaps call this a multicolumn vs a column?
but then we also have the DataModel class which basically corresponds to a group of columns or a subschema (specifying a collection of column names and their types) ... also the built-in File class is used that way
So what should we call the instances of DataModel (and File)? If we used MultiColumn instead of DataModel (that would mean renaming it which is a bit annoying ... and I know it is not technically a column, but from the user perspective that's how you work with it) then we could just call those all multicolumns (even if there is some ambiguity whether we mean the actual columns or the instance of this class) and we could call File instances something like "built-in" multicolumns.
The text was updated successfully, but these errors were encountered:
DataChain:
I like DataChain tables. Certainly on first pass in docs or presentation to signify that the are special and distinct from dataframes or other tables.
Dataset
I like DataChain dataset and persisted datachain - makes the term used more. Solidified in user's mind.
Column vs. signal
I'm not sure here on this. I need to actually work with it to understand better. First impression is that unlike Multi-index, Multicolumn does not really mentally indicate something different than the plural of column. Index carries with it a different mentality of a layer of some kind, so Mulit-index implies something greater is happening.
There are a few terms used in DataChain which I think we need to define more clearly/consistently, especially in docs and any blogposts etc (or maybe I just missed the fact that have this nomeclature already :-))
DataChain:
Dataset :
Column vs signal:
file
which contains a collection of lower level columns (file.path
,file.version
, or evenfile.foo.bar
) and single level columns (e.g. ones created by the users). Pandas has a similar concept with the index and they then call it aMultiIndex
(or a hierarchical index). So we could then perhaps call this a multicolumn vs a column?DataModel
class which basically corresponds to a group of columns or a subschema (specifying a collection of column names and their types) ... also the built-inFile
class is used that wayDataModel
(andFile
)? If we usedMultiColumn
instead ofDataModel
(that would mean renaming it which is a bit annoying ... and I know it is not technically a column, but from the user perspective that's how you work with it) then we could just call those all multicolumns (even if there is some ambiguity whether we mean the actual columns or the instance of this class) and we could call File instances something like "built-in" multicolumns.The text was updated successfully, but these errors were encountered: