Merge remote-tracking branch 'origin/main' into nb/manipulation_funct…

…ion_basics
JuliaData · Oct 10, 2024 · efde542 · efde542
2 parents d9864ba + 85815e4
commit efde542
Show file tree

Hide file tree

Showing 16 changed files with 610 additions and 307 deletions.
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
@@ -0,0 +1,9 @@
+version: 2
+updates:
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      interval: "monthly"
+    labels:
+      - "dependencies"
+      - "no changelog"
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -22,31 +22,35 @@ jobs:
           - os: windows-latest
             version: '1'
             arch: x86
+          - os: macos-latest
+            version: '1'
+            arch: aarch64
           - os: ubuntu-latest
             version: 'nightly'
             arch: x64
             allow_failure: true
     steps:
-      - uses: actions/checkout@v2
-      - uses: julia-actions/setup-julia@v1
+      - uses: actions/checkout@v4
+      - uses: julia-actions/setup-julia@v2
         with:
           version: ${{ matrix.version }}
           arch: ${{ matrix.arch }}
-      - uses: julia-actions/cache@v1
+      - uses: julia-actions/cache@v2
       - uses: julia-actions/julia-buildpkg@v1
       - uses: julia-actions/julia-runtest@v1
         env:
           JULIA_NUM_THREADS: 4,1
       - uses: julia-actions/julia-processcoverage@v1
-      - uses: codecov/codecov-action@v1
+      - uses: codecov/codecov-action@v4
         with:
           file: lcov.info
+          token: ${{ secrets.CODECOV_TOKEN }}
   docs:
     name: Documentation
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v2
-      - uses: julia-actions/cache@v1
+      - uses: actions/checkout@v4
+      - uses: julia-actions/cache@v2
       - uses: julia-actions/julia-buildpkg@latest
       - uses: julia-actions/julia-docdeploy@latest
         env:

diff --git a/Project.toml b/Project.toml
@@ -18,13 +18,12 @@ PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
 PrettyTables = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d"
 Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
-REPL = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"
 Reexport = "189a3867-3050-52da-a836-e630ba90ab69"
+SentinelArrays = "91c51154-3ec4-41a3-a24f-3f23e20d615c"
 SortingAlgorithms = "a2af1166-a08f-5f64-846c-94a0d3cef48c"
 Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 TableTraits = "3783bdb8-4a98-5b6b-af9a-565f29a5fe9c"
 Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
-SentinelArrays = "91c51154-3ec4-41a3-a24f-3f23e20d615c"
 Unicode = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"
 
 [compat]
@@ -46,6 +45,7 @@ Reexport = "1"
 SentinelArrays = "1.2"
 ShiftedArrays = "1, 2"
 SortingAlgorithms = "0.3, 1"
+Statistics = "1"
 TableTraits = "0.4, 1"
 Tables = "1.9.0"
 Unitful = "1"
@@ -58,12 +58,10 @@ DataValues = "e7dc6d0d-1eca-5fa6-8ad6-5aecde8b7ea5"
 Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
 Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
 OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
+ShiftedArrays = "1277b4bf-5013-50f5-be3d-901d8477a67a"
 SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 Unitful = "1986cc42-f94f-5a68-af5c-568840ba703d"
-ShiftedArrays = "1277b4bf-5013-50f5-be3d-901d8477a67a"
 
 [targets]
-test = ["CategoricalArrays", "Combinatorics", "DataValues",
-        "Dates", "Logging", "OffsetArrays", "Test",
-        "Unitful", "ShiftedArrays", "SparseArrays"]
+test = ["CategoricalArrays", "Combinatorics", "DataValues", "Dates", "Logging", "OffsetArrays", "Test", "Unitful", "ShiftedArrays", "SparseArrays"]
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 DataFrames.jl
 =============
 
-[![Coverage Status](http://codecov.io/github/JuliaData/DataFrames.jl/coverage.svg?branch=main)](http://codecov.io/github/JuliaData/DataFrames.jl?branch=main)
+[![codecov](https://codecov.io/gh/JuliaData/DataFrames.jl/graph/badge.svg?token=DHYzeKcumV)](https://codecov.io/gh/JuliaData/DataFrames.jl)
 [![CI Testing](https://github.com/JuliaData/DataFrames.jl/workflows/CI/badge.svg)](https://github.com/JuliaData/DataFrames.jl/actions?query=workflow%3ACI+branch%3Amain)
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7632427.svg)](https://doi.org/10.5281/zenodo.7632427)
 

diff --git a/docs/Project.toml b/docs/Project.toml
@@ -9,6 +9,7 @@ Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28"
 Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1"
 Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
+TidierData = "fe2206b3-d496-4ee9-a338-6a095c4ece80"
 
 [compat]
 Documenter = "1"
diff --git a/docs/src/man/querying_frameworks.md b/docs/src/man/querying_frameworks.md
@@ -8,6 +8,145 @@ DataFramesMeta.jl, DataFrameMacros.jl and Query.jl. They implement a functionali
 These frameworks are designed both to make it easier for new users to start working with data frames in Julia
 and to allow advanced users to write more compact code.
 
+## TidierData.jl
+[TidierData.jl](https://tidierorg.github.io/TidierData.jl/latest/), part of 
+the [Tidier](https://tidierorg.github.io/Tidier.jl/dev/) ecosystem, is a macro-based 
+data analysis interface that wraps DataFrames.jl.  The instructions below are for version 
+0.16.0 of TidierData.jl.
+
+First, install the TidierData.jl package:
+
+```julia
+using Pkg
+Pkg.add("TidierData")
+```
+
+TidierData.jl enables clean, readable, and fast code for all major data transformation 
+functions including 
+[aggregating](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/summarize/), 
+[pivoting](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/pivots/), 
+[nesting](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/nesting/), 
+and [joining](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/joins/) 
+data frames. TidierData re-exports `DataFrame` from DataFrames.jl, `@chain` from Chain.jl, and 
+Statistics.jl to streamline data operations. 
+
+TidierData.jl is heavily inspired by the `dplyr` and `tidyr` R packages (part of the R 
+`tidyverse`), which it aims to implement using pure Julia by wrapping DataFrames.jl. While
+TidierData.jl borrows conventions from the `tidyverse`, it is important to note that the 
+`tidyverse` itself is often not considered idiomatic R code. TidierData.jl brings 
+data analysis conventions from `tidyverse` into Julia to have the best of both worlds: 
+tidy syntax and the speed and flexibility of the Julia language.
+
+TidierData.jl has two major differences from other macro-based packages. First, TidierData.jl 
+uses tidy expressions. An example of a tidy expression is `a = mean(b)`, where `b` refers 
+to an existing column in the data frame, and `a` refers to either a new or existing column. 
+Referring to variables outside of the data frame requires prefixing variables with `!!`. 
+For example, `a = mean(!!b)` refers to a variable `b` outside the data frame. Second, 
+TidierData.jl aims to make broadcasting mostly invisible through 
+[auto-vectorization](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/). TidierData.jl currently uses a lookup table to decide which functions not to 
+vectorize; all other functions are automatically vectorized. This allows for 
+writing of concise expressions: `@mutate(df, a = a - mean(a))` transforms the `a` column 
+by subtracting each value by the mean of the column. Behind the scenes, the right-hand 
+expression is converted to `a .- mean(a)` because `mean()` is in the lookup table as a 
+function that should not be vectorized. Take a look at the 
+[auto-vectorization](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/) documentation for details.
+
+One major benefit of combining tidy expressions with auto-vectorization is that 
+TidierData.jl code (which uses DataFrames.jl as its backend) can work directly on 
+databases using [TidierDB.jl](https://github.com/TidierOrg/TidierDB.jl), 
+which converts tidy expressions into SQL, supporting DuckDB and several other backends.
+
+```jldoctest tidierdata
+julia> using TidierData
+
+julia> df = DataFrame(
+                name = ["John", "Sally", "Roger"],
+                age = [54.0, 34.0, 79.0],
+                children = [0, 2, 4]
+            )
+3×3 DataFrame
+ Row │ name    age      children
+     │ String  Float64  Int64
+─────┼───────────────────────────
+   1 │ John       54.0         0
+   2 │ Sally      34.0         2
+   3 │ Roger      79.0         4
+
+julia> @chain df begin
+           @filter(children != 2)
+           @select(name, num_children = children)
+       end
+2×2 DataFrame
+ Row │ name    num_children 
+     │ String  Int64        
+─────┼──────────────────────
+   1 │ John               0
+   2 │ Roger              4
+```
+
+Below are examples showcasing `@group_by` with `@summarize` or `@mutate` - analagous to the split, apply, combine pattern.
+
+```jldoctest tidierdata
+julia> df = DataFrame(
+                groups = repeat('a':'e', inner = 2), 
+                b_col = 1:10, 
+                c_col = 11:20, 
+                d_col = 111:120
+            )
+10×4 DataFrame
+ Row │ groups  b_col  c_col  d_col 
+     │ Char    Int64  Int64  Int64 
+─────┼─────────────────────────────
+   1 │ a           1     11    111
+   2 │ a           2     12    112
+   3 │ b           3     13    113
+   4 │ b           4     14    114
+   5 │ c           5     15    115
+   6 │ c           6     16    116
+   7 │ d           7     17    117
+   8 │ d           8     18    118
+   9 │ e           9     19    119
+  10 │ e          10     20    120
+
+julia> @chain df begin
+           @filter(b_col > 2)
+           @group_by(groups)
+           @summarise(median_b = median(b_col), 
+                      across((b_col:d_col), mean))   
+       end
+4×5 DataFrame
+ Row │ groups  median_b  b_col_mean  c_col_mean  d_col_mean 
+     │ Char    Float64   Float64     Float64     Float64    
+─────┼──────────────────────────────────────────────────────
+   1 │ b            3.5         3.5        13.5       113.5
+   2 │ c            5.5         5.5        15.5       115.5
+   3 │ d            7.5         7.5        17.5       117.5
+   4 │ e            9.5         9.5        19.5       119.5
+
+julia> @chain df begin
+           @filter(b_col > 4 && c_col <= 18)
+           @group_by(groups)
+           @mutate(
+               new_col = b_col + maximum(d_col),
+               new_col2 = c_col - maximum(d_col),
+               new_col3 = case_when(c_col >= 18  => "high",
+                                    c_col > 15   => "medium",
+                                    true         => "low"))
+           @select(starts_with("new"))
+           @ungroup # required because `@mutate` does not ungroup
+       end
+4×4 DataFrame
+ Row │ groups  new_col  new_col2  new_col3 
+     │ Char    Int64    Int64     String   
+─────┼─────────────────────────────────────
+   1 │ c           121      -101  low
+   2 │ c           122      -100  medium
+   3 │ d           125      -101  medium
+   4 │ d           126      -100  high
+```
+
+For more examples, please visit the [TidierData.jl](https://tidierorg.github.io/TidierData.jl/latest/) documentation.
+
 ## DataFramesMeta.jl
 
 The [DataFramesMeta.jl](https://github.com/JuliaStats/DataFramesMeta.jl) package

diff --git a/docs/src/man/working_with_dataframes.md b/docs/src/man/working_with_dataframes.md
@@ -812,14 +812,21 @@ julia> df = DataFrame(A=1:4, B=4.0:-1.0:1.0)
    3 │     3      2.0
    4 │     4      1.0
 
-julia> combine(df, names(df) .=> sum)
+julia> combine(df, All() .=> sum)
 1×2 DataFrame
  Row │ A_sum  B_sum
      │ Int64  Float64
 ─────┼────────────────
    1 │    10     10.0
 
-julia> combine(df, names(df) .=> sum, names(df) .=> prod)
+julia> combine(df, All() .=> sum, All() .=> prod)
+1×4 DataFrame
+ Row │ A_sum  B_sum    A_prod  B_prod
+     │ Int64  Float64  Int64   Float64
+─────┼─────────────────────────────────
+   1 │    10     10.0      24     24.0
+
+julia> combine(df, All() .=> [sum prod]) # the same using 2-dimensional broadcasting
 1×4 DataFrame
  Row │ A_sum  B_sum    A_prod  B_prod
      │ Int64  Float64  Int64   Float64
@@ -830,6 +837,90 @@ julia> combine(df, names(df) .=> sum, names(df) .=> prod)
 If you would prefer the result to have the same number of rows as the source
 data frame, use `select` instead of `combine`.
 
+In the remainder of this section we will discuss more advanced topics related
+to the operation specification syntax, so you may decide to skip them if you
+want to focus on the most common usage patterns.
+
+A `DataFrame` can store values of any type as its columns, for example
+below we show how one can store a `Tuple`:
+
+```
+julia> df2 = combine(df, All() .=> extrema)
+1×2 DataFrame
+ Row │ A_extrema  B_extrema
+     │ Tuple…     Tuple…
+─────┼───────────────────────
+   1 │ (1, 4)     (1.0, 4.0)
+```
+
+Later you might want to expand the tuples into separate columns storing the computed
+minima and maxima. This can be achieved by passing multiple columns for the output.
+Here is an example of how this can be done by writing the column names by-hand for a single
+input column:
+
+```
+julia> combine(df2, "A_extrema" => identity => ["A_min", "A_max"])
+1×2 DataFrame
+ Row │ A_min  A_max
+     │ Int64  Int64
+─────┼──────────────
+   1 │     1      4
+```
+
+You can extend it to handling all columns in `df2` using broadcasting:
+
+```
+julia> combine(df2, All() .=> identity .=> [["A_min", "A_max"], ["B_min", "B_max"]])
+1×4 DataFrame
+ Row │ A_min  A_max  B_min    B_max
+     │ Int64  Int64  Float64  Float64
+─────┼────────────────────────────────
+   1 │     1      4      1.0      4.0
+```
+
+This approach works, but can be improved. Instead of writing all the column names
+manually we can instead use a function as a way to specify target column names
+based on source column names:
+
+```
+julia> combine(df2, All() .=> identity .=> c -> first(c) .* ["_min", "_max"])
+1×4 DataFrame
+ Row │ A_min  A_max  B_min    B_max
+     │ Int64  Int64  Float64  Float64
+─────┼────────────────────────────────
+   1 │     1      4      1.0      4.0
+```
+
+Note that in this example we needed to pass `identity` explicitly since with
+`All() => (c -> first(c) .* ["_min", "_max"])` the right-hand side part would be
+treated as a transformation and not as a rule for target column names generation.
+
+You might want to perform the transformation of the source data frame into the result
+we have just shown in one step. This can be achieved with the following expression:
+
+```
+julia> combine(df, All() .=> Ref∘extrema .=> c -> c .* ["_min", "_max"])
+1×4 DataFrame
+ Row │ A_min  A_max  B_min    B_max
+     │ Int64  Int64  Float64  Float64
+─────┼────────────────────────────────
+   1 │     1      4      1.0      4.0
+```
+
+Note that in this case we needed to add a `Ref` call in the `Ref∘extrema` operation specification.
+Without `Ref`, `combine` iterates the contents of the value returned by the operation specification function,
+which in our case is a tuple of numbers, and tries to expand it assuming that each produced value represents one row,
+so one gets an error:
+
+```
+julia> combine(df, All() .=> extrema .=> [c -> c .* ["_min", "_max"]])
+ERROR: ArgumentError: 'Tuple{Int64, Int64}' iterates 'Int64' values,
+which doesn't satisfy the Tables.jl `AbstractRow` interface
+```
+
+Note that we used `Ref` as it is a container that is typically used in DataFrames.jl when one
+wants to store one row, however, in general it could be another iterator (e.g. a tuple).
+
 ## Handling of Columns Stored in a `DataFrame`
 
 Functions that transform a `DataFrame` to produce a

diff --git a/src/DataFrames.jl b/src/DataFrames.jl
@@ -1,6 +1,6 @@
 module DataFrames
 
-using Statistics, Printf, REPL
+using Statistics, Printf
 using Reexport, SortingAlgorithms, Compat, Unicode, PooledArrays
 @reexport using Missings, InvertedIndices
 using Base.Sort, Base.Order, Base.Iterators, Base.Threads