Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataChain objects nomeclature #534

Open
tibor-mach opened this issue Oct 22, 2024 · 1 comment
Open

DataChain objects nomeclature #534

tibor-mach opened this issue Oct 22, 2024 · 1 comment
Labels
documentation Improvements or additions to documentation

Comments

@tibor-mach
Copy link
Contributor

There are a few terms used in DataChain which I think we need to define more clearly/consistently, especially in docs and any blogposts etc (or maybe I just missed the fact that have this nomeclature already :-))

DataChain:

  • the product name and the python library :-)
  • But also the object we work with in DataChain, basically what elsewhere is usually called a dataframe or a table ... maybe we can call these DataChain tables?

Dataset :

  • A dataset is a persisted DataChain? Should we call it a DataChain dataset? I would probably just say a persisted datachain or a persisted table (if we call the instances of DataChain class tables)

Column vs signal:

  • we have tables with hierarchical columns and we sometimes call them columns and sometimes signals
  • I would just use the word columns everywhere and forget about signals because everyone knows what table columns and signals are more vague.
  • We need to be able to clearly differentiate between a column like file which contains a collection of lower level columns ( file.path, file.version , or even file.foo.bar) and single level columns (e.g. ones created by the users). Pandas has a similar concept with the index and they then call it a MultiIndex (or a hierarchical index). So we could then perhaps call this a multicolumn vs a column?
  • but then we also have the DataModel class which basically corresponds to a group of columns or a subschema (specifying a collection of column names and their types) ... also the built-in File class is used that way
  • So what should we call the instances of DataModel (and File)? If we used MultiColumn instead of DataModel (that would mean renaming it which is a bit annoying ... and I know it is not technically a column, but from the user perspective that's how you work with it) then we could just call those all multicolumns (even if there is some ambiguity whether we mean the actual columns or the instance of this class) and we could call File instances something like "built-in" multicolumns.
@jendefig
Copy link

DataChain:
I like DataChain tables. Certainly on first pass in docs or presentation to signify that the are special and distinct from dataframes or other tables.

Dataset
I like DataChain dataset and persisted datachain - makes the term used more. Solidified in user's mind.

Column vs. signal
I'm not sure here on this. I need to actually work with it to understand better. First impression is that unlike Multi-index, Multicolumn does not really mentally indicate something different than the plural of column. Index carries with it a different mentality of a layer of some kind, so Mulit-index implies something greater is happening.

@shcheklein shcheklein added the documentation Improvements or additions to documentation label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants