docs: MVP plotly-express docs (#554)

Minimum required docs for the plotly-express plugin. Here are the outstanding items: 1. Fill out "other". 2. Document `ecdf` once it is implemented. Directions for testing: As of 7/17, everything needed for testing is baked into a release. Here's a simple testing environment using pip-installed DH. ``` # make new dir for testing mkdir test-dx && cd test-dx # create env for installs python -m venv test-dx-venv source test-dx-venv/bin/activate # install some necessary things for the build pip install --upgrade pip setuptools # install the server, need 35.1 or 34.3 pip install deephaven-server==0.35.1 # install the plugin pip install deephaven-plugin-plotly-express # I need to do this to get `which deephaven` to give the correct venv version, you may not deactivate source test-dx-venv/bin/activate # start the server deephaven server ``` --------- Co-authored-by: margaretkennedy <[email protected]>
deephaven · Jul 29, 2024 · 4c556d3 · 4c556d3
1 parent def7069
commit 4c556d3
Show file tree

Hide file tree

Showing 33 changed files with 1,209 additions and 898 deletions.
diff --git a/.gitignore b/.gitignore
@@ -35,3 +35,6 @@ playwright/.cache/
 # virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
 hs_err_pid*
 replay_pid*
+
+plugins-venv/
+plugins-dev-venv/
diff --git a/plugins/plotly-express/docs/README.md b/plugins/plotly-express/docs/README.md
@@ -117,6 +117,23 @@ my_plot = dx.line(table=my_table, x="Timestamp", y="Price", color="Sym")
 
 In this example, we create a Deephaven table and create a line plot of `Timestamp` against `Price` with automatic downsampling. A trace is created for each value in the `Sym` column, each of which has a unique color.
 
+## Documentation Terminology
+
+The documentation for Deephaven Express routinely uses some common terms to help clarify how plots are intended to be used:
+
+- **Variable**: Variables, usually represented as columns in a Deephaven table, are a series of data points or observations of a particular characteristic in the data set. Examples include age, GDP, stock price, wind direction, sex, zip code, shoe size, square footage, and height.
+
+The following terms define different types of variable. Variable types are important because any given plot is usually only intended to be used with a specific variable type:
+
+- **Categorical variable**: This is a variable with a countable (often small) number of possible measurements for which an average cannot be computed. Examples include sex, country, flower species, stock symbol, and last name. Zip code is also a categorical variable, because while it is made of numbers and can technically be averaged, the "average zip code" is not a sensible concept.
+- **Discrete numeric variable** (often abbreviated to _discrete variable_): This is a variable with a countable number of possible measurements for which an average can be computed. These are typically represented with whole numbers. Examples include the number of wins in a season, number of bedrooms in a house, the size of one's immediate family, and the number of letters in a word.
+- **Continuous numberic variable** (often abbreviated to _continuous variable_): This is a variable with a spectrum of possible measurements for which an average can be computed. These are typically represented with decimal or fractional numbers. Examples include height, square footage of a home, length of a flower petal, price of a stock, and the distance between two stars.
+
+The following terms define relationships between variables. They do not describe attributes of a variable, but describe how a variable relates to others:
+
+- **Explanatory variable**: A variable that other variables depend on in some important way. The most common example is time. If explanatory variables are displayed in a plot, they are presented on the x-axis by convention.
+- **Response variable**: A variable that depends directly on another variable (the explanatory variable) in some important way. A rule of thumb is that explanatory variables are used to make predictions about repsonse variables, but not conversely. If response variables are displayed in a plot, they are presented on the y-axis by convention.
+
 ## Contributing
 
 We welcome contributions to Deephaven Plotly Express! If you encounter any issues, have ideas for improvements, or would like to add new features, please open an issue or submit a pull request on the [GitHub repository](https://github.com/deephaven/deephaven-plugins).

diff --git a/plugins/plotly-express/docs/area.md b/plugins/plotly-express/docs/area.md
@@ -1,16 +1,45 @@
 # Area Plot
 
-An area plot, also known as a stacked area chart, is a data visualization that uses multiple filled areas stacked on top of one another to represent the cumulative contribution of distinct categories over a continuous interval or time period. This makes it valuable for illustrating the composition and trends within data, especially when comparing the distribution of different categories.
+An area plot, also known as a stacked area chart, is a data visualization that uses multiple filled areas stacked on top of one another to represent the cumulative contribution of distinct categories over a continuous interval or time period. Area plots always start the y-axis at zero, because the height of each line at any point is exactly equal to its contribution to the whole, and the proportion of each category's contribution must be represented faithfully.
 
-Area plots are useful for:
+Area plots are appropriate when the data contain a continuous response variable that directly depends on a continuous explanatory variable, such as time. Further, the response variable can be broken down into contributions from each of several independent categories, and those categories are represented by an additional categorical variable. 
 
-1. **Comparing Category Trends**: Use area plots to compare and track trends in different categories over time, providing a clear view of their cumulative contributions.
-2. **Proportional Representation**: When you need to show the relative proportion of different categories within a dataset, area plots offer an effective means of visualizing this information.
-3. **Data Composition**: Area plots are ideal for revealing the composition and distribution of data categories, making them useful in scenarios where the relative makeup of categories is crucial.
-4. **Time Series Analysis**: For time-dependent data, area plots are valuable for displaying changes in categorical contributions and overall trends over time.
+### What are area plots useful for?
+
+- **Visualizing trends over time**: Area plots are great for displaying the trend of a single continuous variable. The filled areas can make it easier to see the magnitude of changes and trends compared to line plots.
+- **Displaying cumulative totals**: Area plots are effective in showing cumulative totals over a period. They can help in understanding the contribution of different categories to the total amount and how these contributions evolve.
+- **Comparing multiple categories**: Rather than providing a single snapshot of the composition of a total, area plots show how contributions from each category change over time.
 
 ## Examples
 
+### A basic area plot
+
+Visualize the relationship between two variables by passing each column name to the `x` and `y` arguments.
+
+```python order=area_plot,usa_population
+import deephaven.plot.express as dx
+gapminder = dx.data.gapminder()
+
+# subset to get a specific group
+usa_population = gapminder.where("Country == `United States`")
+
+area_plot = dx.area(usa_population, x="Year", y="Pop")
+```
+
+### Area by group
+
+Area plots are unique in that the y-axis demonstrates each groups' total contribution to the whole. Pass the name of the grouping column(s) to the `by` argument.
+
+```python order=area_plot_group,large_countries_population
+import deephaven.plot.express as dx
+gapminder = dx.data.gapminder()
+
+# subset to get several countries to compare
+large_countries_population = gapminder.where("Country in `United States`, `India`, `China`")
+
+# cumulative trend showing contribution from each group
+area_plot_group = dx.area(large_countries_population, x="Year", y="Pop", by="Country")
+```
 
 ## API Reference
 ```{eval-rst}

diff --git a/plugins/plotly-express/docs/bar.md b/plugins/plotly-express/docs/bar.md
@@ -1,18 +1,55 @@
 # Bar Plot
 
-A bar plot is a graphical representation of data that uses rectangular bars to display the values of different categories or groups, making it easy to compare and visualize the distribution of data.
+A bar plot is a graphical representation of data that uses rectangular bars to display the values of different categories or groups. Bar plots aggregate the response variable across the entire dataset for each category, so that the y-axis represents the sum of the response variable per category.
 
-Advantages of bar plots include:
+Bar plots are appropriate when the data contain a continuous response variable that is directly related to a categorical explanatory variable. Additionally, if the response variable is a cumulative total of contributions from different subcategories, each bar can be broken up to demonstrate those contributions.
 
-1. **Comparative Clarity**: Bar plots are highly effective for comparing data across different categories or groups. They provide a clear visual representation of relative differences and make it easy to identify trends within the dataset.
-2. **Categorical Representation**: Bar plots excel at representing categorical data, such as survey responses, product sales by region, or user preferences. Each category is presented as a distinct bar, simplifying the visualization of categorical information.
-3. **Ease of Use**: Bar plots are user-friendly and quick to generate, making them a practical choice for various applications.
-4. **Data Aggregation**: Bar plots allow for easy aggregation of data within categories, simplifying the visualization of complex datasets, and aiding in summarizing and comparing information efficiently.
+### What are bar plots useful for?
 
-Bar plots have limitations and are not suitable for certain scenarios. They are not ideal for continuous data, ineffective for multi-dimensional data exceeding two dimensions, and unsuitable for time-series data trends. Additionally, they become less practical with extremely sparse datasets and are inadequate for representing complex interactions or correlations among multiple variables.
+- **Comparing categorical data**: Bar plots are ideal for comparing the quantities or frequencies of different categories. The height of each bar represents the value of each category, making it easy to compare them at a glance.
+- **Decomposing data by category**: When the data belong to several independent categories, bar plots make it easy to visualize the relative contributions of each category to the overall total. The bar segments are colored by category, making it easy to identify the contribution of each.
+- **Tracking trends**: If the categorical explanatory variable can be ordered left-to-right (like day of week), then bar plots provide a visualization of how the response variable changes as the explanatory variable evolves.
 
 ## Examples
 
+### A basic bar plot
+
+Visualize the relationship between a continuous variable and a categorical or discrete variable by passing the column names to the `x` and `y` arguments.
+
+```python order=bar_plot,tips
+import deephaven.plot.express as dx
+tips = dx.data.tips()
+
+bar_plot = dx.bar(tips, x="Day", y="TotalBill")
+```
+
+Change the x-axis ordering by sorting the dataset by the categorical variable.
+
+```python order=ordered_bar_plot,tips
+import deephaven.plot.express as dx
+tips = dx.data.tips()
+
+# sort the dataset to get a specific x-axis ordering, sort() acts alphabetically
+ordered_bar_plot = dx.bar(tips.sort("Day"), x="Day", y="TotalBill")
+```
+
+### Partition bars by group
+
+Break bars down by group by passing the name of the grouping column(s) to the `by` argument.
+
+```python order=bar_plot_smoke,bar_plot_sex,tips
+import deephaven.plot.express as dx
+tips = dx.data.tips()
+
+sorted_tips = tips.sort("Day")
+
+# group by smoker / non-smoker
+bar_plot_smoke = dx.bar(sorted_tips, x="Day", y="TotalBill", by="Smoker")
+
+# group by male / female
+bar_plot_sex = dx.bar(sorted_tips, x="Day", y="TotalBill", by="Sex")
+```
+
 ## API Reference
 ```{eval-rst}
 .. dhautofunction:: deephaven.plot.express.bar

diff --git a/plugins/plotly-express/docs/box.md b/plugins/plotly-express/docs/box.md
@@ -1,16 +1,45 @@
 # Box Plot
 
-A box plot, also known as a box-and-whisker plot, is a data visualization that presents a summary of a dataset's distribution. It displays key statistics such as the median, quartiles, and potential outliers, making it a useful tool for visually representing the central tendency and variability of data.
+A box plot, also known as a box-and-whisker plot, is a data visualization that presents a summary of a dataset's distribution. It displays key statistics such as the median, quartiles, and potential outliers, making it a useful tool for visually representing the central tendency and variability of data. To learn more about the mathematics involved in creating box plots, check out [this article](https://asq.org/quality-resources/box-whisker-plot).
 
-Box plots are useful for:
+Box plots are appropriate when the data have a continuous variable of interest. If there is an additional categorical variable that the variable of interest depends on, side-by-side box plots may be appropriate using the `by` argument.
 
-1. **Visualizing Spread and Center**: Box plots provide a clear representation of the spread and central tendency of data, making it easy to understand the distribution's characteristics.
-2. **Identification of Outliers**: They are effective in identifying outliers within a dataset, helping to pinpoint data points that deviate significantly from the norm.
-3. **Comparative Analysis**: Box plots allow for easy visual comparison of multiple datasets or categories, making them useful for assessing variations and trends in data.
-4. **Robustness**: Box plots are robust to extreme values and data skewness, providing a reliable means of visualizing data distributions even in the presence of outliers or non-normal data.
+### What are box plots useful for?
+
+- **Visualizing overall distribution**: Box plots reveal the distribution of the variable of interest. They are good first-line tools for assessing whether a variable's distribution is symmetric, right-skewed, or left-skewed.
+- **Assessing center and spread**: A box plot displays the center (median) of a dataset using the middle line, and displays the spread (IQR) using the width of the box.
+- **Identifying potential outliers**: The dots displayed in a box plot are considered candidates for being outliers. These should be examined closely, and their frequency can help determine whether the data come from a heavy-tailed distribution.
 
 ## Examples
 
+### A basic box plot
+
+Visualize the distribution of a single variable by passing the column name to `x` or `y`.
+
+```python order=box_plot_x,box_plot_y,tips
+import deephaven.plot.express as dx
+tips = dx.data.tips()
+
+# control the plot orientation using `x` or `y`
+box_plot_x = dx.box(tips, x="TotalBill")
+box_plot_y = dx.box(tips, y="TotalBill")
+```
+
+### Distributions for multiple groups
+
+Box plots are useful for comparing the distributions of two or more groups of data. Pass the name of the grouping column(s) to the `by` argument.
+
+```python order=box_plot_group_1,box_plot_group_2,tips
+import deephaven.plot.express as dx
+tips = dx.data.tips()
+
+# total bill distribution by Smoker / non-Smoker
+box_plot_group_1 = dx.box(tips, y="TotalBill", by="Smoker")
+
+# total bill distribution by male / female
+box_plot_group_2 = dx.box(tips, y="TotalBill", by="Sex")
+```
+
 ## API Reference
 ```{eval-rst}
 .. dhautofunction:: deephaven.plot.express.box

diff --git a/plugins/plotly-express/docs/candlestick.md b/plugins/plotly-express/docs/candlestick.md
@@ -6,15 +6,46 @@ Interpreting a candlestick chart involves understanding the visual representatio
 
 In a bullish (upward, typically shown as green) candlestick, the open is typically at the bottom of the body, and the close is at the top, indicating a price increase. In a bearish (downward, typically shown as red) candlestick, the open is at the top of the body, and the close is at the bottom, suggesting a price decrease. One can use these patterns, along with the length of the wicks and the context of adjacent candlesticks, to analyze trends.
 
-Candlestick plots are useful for:
+### What are candlestick plots useful for?
 
-1. **Analyzing Financial Markets**: They are a standard tool in technical analysis for understanding price movements, identifying trends, and potential reversal points in financial markets, such as stocks, forex, and cryptocurrencies.
-2. **Short to Medium-Term Trading**: Candlestick patterns are well-suited for short to medium-term trading strategies, where timely decisions are based on price patterns and trends over a specific time frame.
-3. **Pattern Recognition**: They aid in recognizing and interpreting common candlestick patterns, which can provide insights into market sentiment and potential price movements.
-4. **Visualizing Variation in Price Data**: Candlestick charts offer a visually intuitive way to represent variability in price data, making them valuable for traders and analysts who prefer a visual approach to data analysis.
+- **Analyzing financial markets**: Candlestick plots are a standard tool in technical analysis for understanding price movements, identifying trends, and potential reversal points in financial instruments, such as stocks, forex, and cryptocurrencies.
+- **Short to medium-term trading**: Candlestick patterns are well-suited for short to medium-term trading strategies, where timely decisions are based on price patterns and trends over a specific time frame.
+- **Visualizing variation in price data**: Candlestick plots offer a visually intuitive way to represent variability in price data, making them valuable for traders and analysts who prefer a visual approach to data analysis.
 
 ## Examples
 
+### A basic candlestick plot
+
+Visualize the key summary statistics of a stock price as it evolves. Specify the column name of the instrument with `x`, and pass the `open`, `high`, `low`, and `close` arguments the appropriate column names.
+
+```python order=candlestick_plot,stocks_1min_ohlc,stocks
+import deephaven.plot.express as dx
+import deephaven.agg as agg
+stocks = dx.data.stocks()
+
+# compute ohlc per symbol for each minute
+stocks_1min_ohlc = stocks.update_view(
+    "BinnedTimestamp = lowerBin(Timestamp, 'PT1m')"
+).agg_by(
+    [
+        agg.first("Open=Price"),
+        agg.max_("High=Price"),
+        agg.min_("Low=Price"),
+        agg.last("Close=Price"),
+    ],
+    by=["Sym", "BinnedTimestamp"],
+)
+
+candlestick_plot = dx.candlestick(
+    stocks_1min_ohlc.where("Sym == `DOG`"),
+    x="BinnedTimestamp",
+    open="Open",
+    high="High",
+    low="Low",
+    close="Close",
+)
+```
+
 ## API Reference
 ```{eval-rst}
 .. dhautofunction:: deephaven.plot.express.candlestick