Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thinking about improving data updates #647

Closed
monfera opened this issue Jun 15, 2016 · 6 comments
Closed

Thinking about improving data updates #647

monfera opened this issue Jun 15, 2016 · 6 comments

Comments

@monfera
Copy link
Contributor

monfera commented Jun 15, 2016

tl;dr
There's more and more code that couples the aspect of plotting logic with the aspect of incrementally propagating changes, e.g. see all things going on in Plotly.restyle. Would be good to discuss ways to improve on the situation. Manual code leads to a tangle and some small, simple library focused on change propagation e.g. MobX would be worth looking into.

Plotting turns a stream of user intent into a stream of side effects such as DOM updates

Plotting can be conceived of as a black box:

  • input streams are plot specifications, typically the payloads in Plotly.plot, Plotly.restyle, Plotly.relayout, animation inducing user calls as well as DOM events such as window.resize and mousedown
  • output is a stream of side effecting operations, e.g. DOM mutations, WebGL API calls, and sometimes event callbacks
  • currently, some output is provided by encouraging users to read directly from internal object state but it's something to move away from, by providing a query API and/or event callbacks with meaningful data, so I'll ignore this

The use of 'stream' highlights the fact that with user pointer operations, restyle/relayout, animation etc. generally make plotting a temporal process, rather than something that can be modeled with a function with some input JSON and an output SVG - even if some of the uses are as simple as this special case.

Plotting logic is a directed acyclic graph of computation nodes

We have multiple pieces of input (e.g. data[0].x) at the input and DOM mutating calls as the output. However there's complex calculation in the middle that can be thought of as a DAG. For example,

  • the above x vector serves as the basis for calculating a [min, max] domain that will determine the bounds of the X axis
  • the x vector is also trivially input to scatter point positions, however, a scale transform converts domain values to e.g. pixel coordinates
  • for things like the boxplot, there may be various aggregations building atop of the x vector
  • aesthetics might depend on things like how long the x vector is; maybe defaulting from scatterplot to a density plot at some threshold

All such calculations themselves can be input to downstream calculations.

Plotting needs to be economical

While it would be possible to make a single function whose inputs are {domRoot, userIntentHistory}, it's impractical: response times with a naive implementation would be too high (keeping `userIntentHistory is merely of modest size impact). There's no way to recompute everything from scratch and expect a 60FPS frame rate when turning a WebGL plot or animating something.

This means that there needs to be some kind of caching, therefore state management. The sole purpose of maintaining state is caching (besides this, we may retain userIntentHistory to allow time travel, and of course the output streams are linked to calls that modify the DOM).

Means of reducing recomputation costs

Ideally we'd like to

  • Only recompute what's strictly needed. For example, if I add a new highest value to vector x it needs to lead to an increased visible X axis domain, provided it's set to automatic. However, if the newly inserted value is inside the bounds, there's no need to recalculate anything that depends only on the [min, max] domain. Sure, sometimes there's no harm due to speed of recalculation or lack of need for speed, but there are cases when it's useful to be fairly granular about recalculations due to some specific performance need. Solving these specific performance needs one by one, without a formal change propagation approach is brittle.
  • It may even be useful, necessary and easy to pick calculation algorithms to be incremental. For example, a newly arriving X value can be directly used to update the [min, max] bounds, as opposed to inserting it in the preexisting large vector and applying the vector extent calculation. Similarly, many types of aggregates can be calculated on-line as well as batch. For example, mean, variance and standard deviation.

Some possible tools

Handwritten userland JavaScript isn't quite good for managing a dependency graph, because given enough nodes and optimization rounds, there will be inevitable cache invalidation issues, and potentially, memory leaks. Keeping things consistent and in in sync is also a challenge especially in the presence of asynchronous events. Most importantly, coupling the plot logic aspect with the incremental recalculation aspect makes both aspects hard to decipher, debug and further develop.

There are a lot of tools that provide some kind of framework for calculating and propagating values that can change over time, responding to input, inspired by Functional Reactive Programming. Without endorsing any of these excellent libraries (xstream, most.js etc.) perhaps MobX would feel closest to the current architecture in that it gives you objects that have properties acting like calculated spreadsheet cells, and as @etpinard suggested in the 2.0 wishlist, object-oriented, but investigation would be needed to see how it fits. All these libs are around 10k compressed.

History

We've touched on related topics in the past; a few inspirations:

@etpinard
Copy link
Contributor

@monfera Thanks for the feedback.

I'd vote 👎 for bringing any functional reactive programming library.

Like you point our update system is in dire need of a refactor, but an in-house seems best in terms of scaling and maintenance.

We already have decent building block namely nestedProperty and our attribute declaration system. I should be that hard to come up with a performant and flexible update framework for our needs.

@mdtusz
Copy link
Contributor

mdtusz commented Jun 15, 2016

Beat me to it. In any case:

While I agree that we should improve our data model, I'm not sure that using something like mobx is the right choice - reactive programming is great for UI and scenarios where updates don't require immense calculation and/or are direct in their codepaths, but for our uses, I imagine we would quickly find ourselves patching things to fit our use case where sometimes the data needs to be transformed by A, then B, then A again, before we can render.

What they provide also loses value when not working with in-memory state - there is plenty of plotly.js state wrapped up in SVG dom, so until we separate that out, the transition would be quite rocky.

I hate to reinvent the wheel, but I'm of the opinion that what we may need is closer to a tractor tread. I'd advocate instead for creating a strict pattern for updates that works for us, and it very likely will be more of a puppetmaster pattern.

@monfera
Copy link
Contributor Author

monfera commented Jun 15, 2016

Whether we expect it from our own utilities/patterns or an external library, do we have roughly similar notions in mind about the needs?

  1. allow incremental additions, removal and changes of data points or entire traces, individually or in batches, happening over time
  2. similarly, incremental or grouped changes to configuration or layout aspects
  3. aid the minimization of compute work/time to be done on updates caused by these, partly to respond to changes quickly and partly to avoid disruption, flashing, relayout of the output and retain object constancy of axes, lines and points
  4. ensure that data or rendered output does not become stale, i.e. make it hard to not channel in some dependency (if Y depends on X and Y changed, X needs to change unless explicit logic infers that no change is needed or there's express strategy such as throttling, debounce etc. is in place to defer or batch work)
  5. make it easy to understand where values came from and what transformations they went through, if there's a bug and we need to know the point at which something suspicious got in

    plenty of plotly.js state wrapped up in SVG dom, so until we separate that out, the transition would be quite rocky

Things likeexisting codebase, test case coverage, documentation, examples etc. incorporate a lot of work already spent, and lessons learnt. Which is why I'd like to learn more about Puppetmaster (is it this one?)? Also what do you mean by tractor tread in this context?

@monfera
Copy link
Contributor Author

monfera commented Jun 15, 2016

@mdtusz on a second thought, you more likely mean Puppet (vs Chef), idempotence concept etc.

@mdtusz
Copy link
Contributor

mdtusz commented Jun 16, 2016

I wasn't really referring to Chef/Puppet at all - those aren't really relevant here. I meant more just a pattern where some section of code is in charge of orchestrating our update operations - albeit in a cleaner and more organized way than we currently do. Perhaps using the term puppetmaster was misleading.

@monfera
Copy link
Contributor Author

monfera commented Jun 18, 2016

Closing it in favor of #648.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants