diff --git a/docs/user-guide/expressions/user-defined-functions.md b/docs/user-guide/expressions/user-defined-functions.md index 40de817c4..942225128 100644 --- a/docs/user-guide/expressions/user-defined-functions.md +++ b/docs/user-guide/expressions/user-defined-functions.md @@ -1,7 +1,7 @@ # User Defined functions -You should be convinced by now that polar expressions are so powerful and flexible that the need for custom python functions -is much less needed than you might need in other libraries. +You should be convinced by now that polar expressions are so powerful and flexible that there is much less need for custom python functions +than in other libraries. Still, you need to have the power to be able to pass an expression's state to a third party library or apply your black box function over data in polars. diff --git a/docs/user-guide/expressions/window.md b/docs/user-guide/expressions/window.md index 6a294d8d7..1fc5af9db 100644 --- a/docs/user-guide/expressions/window.md +++ b/docs/user-guide/expressions/window.md @@ -1,7 +1,7 @@ # Window functions Window functions are expressions with superpowers. They allow you to perform aggregations on groups in the -`select` context. Let's get a feel of what that means. First we create a dataset. The dataset loaded in the +`select` context. Let's get a feel for what that means. First we create a dataset. The dataset loaded in the snippet below contains information about pokemon: {{code_block('user-guide/expressions/window','pokemon',['read_csv'])}} diff --git a/docs/user-guide/io/csv.md b/docs/user-guide/io/csv.md index b648f02f0..a1c22f533 100644 --- a/docs/user-guide/io/csv.md +++ b/docs/user-guide/io/csv.md @@ -19,5 +19,5 @@ file and instead returns a lazy computation holder called a `LazyFrame`. {{code_block('user-guide/io/csv','scan',['scan_csv'])}} -If you want to know why this is desirable, you can read more about those `Polars` +If you want to know why this is desirable, you can read more about these `Polars` optimizations [here](../concepts/lazy-vs-eager.md). diff --git a/docs/user-guide/io/multiple.md b/docs/user-guide/io/multiple.md index 43297e80f..72a79a9d0 100644 --- a/docs/user-guide/io/multiple.md +++ b/docs/user-guide/io/multiple.md @@ -2,7 +2,7 @@ `Polars` can deal with multiple files differently depending on your needs and memory strain. -Let's create some files to give use some context: +Let's create some files to give us some context: {{code_block('user-guide/io/multiple','create',['write_csv'])}} diff --git a/docs/user-guide/lazy/execution.md b/docs/user-guide/lazy/execution.md index adfb579e7..894c181ef 100644 --- a/docs/user-guide/lazy/execution.md +++ b/docs/user-guide/lazy/execution.md @@ -34,7 +34,7 @@ shape: (14_029, 6) └─────────┴───────────────────────────┴─────────────┴────────────┴───────────────┴────────────┘ ``` -Above we see that from the 10 Million rows there 14,029 rows match our predicate. +Above we see that from the 10 million rows there are 14,029 rows that match our predicate. With the default `collect` method Polars processes all of your data as one batch. This means that all the data has to fit into your available memory at the point of peak memory usage in your query. diff --git a/docs/user-guide/lazy/query_plan.md b/docs/user-guide/lazy/query_plan.md index 7084e94a0..6845725d7 100644 --- a/docs/user-guide/lazy/query_plan.md +++ b/docs/user-guide/lazy/query_plan.md @@ -11,7 +11,7 @@ We can understand both the non-optimized and optimized query plans with visualiz ### Graphviz visualization -First we visualise the non-optimized plan by setting `optimized=False`. +First we visualize the non-optimized plan by setting `optimized=False`. {{code_block('user-guide/lazy/query_plan','plan',['show_graph'])}} @@ -26,7 +26,7 @@ First we visualise the non-optimized plan by setting `optimized=False`. --8<-- "python/user-guide/lazy/query_plan.py:createplan" ``` -The query plan visualisation should be read from bottom to top. In the visualisation: +The query plan visualization should be read from bottom to top. In the visualization: - each box corresponds to a stage in the query plan - the `sigma` stands for `SELECTION` and indicates any filter conditions @@ -55,7 +55,7 @@ The printed plan should also be read from bottom to top. This non-optimized plan ## Optimized query plan -Now we visualise the optimized plan with `show_graph`. +Now we visualize the optimized plan with `show_graph`. {{code_block('user-guide/lazy/query_plan','show',['show_graph'])}} diff --git a/docs/user-guide/lazy/using.md b/docs/user-guide/lazy/using.md index 87e371faf..d777557da 100644 --- a/docs/user-guide/lazy/using.md +++ b/docs/user-guide/lazy/using.md @@ -10,7 +10,7 @@ Here we see how to use the lazy API starting from either a file or an existing ` ## Using the lazy API from a file -In the ideal case, we use the lazy API right from a file as the query optimizer may help us to reduce the amount of data we read from the file. +In the ideal case we would use the lazy API right from a file as the query optimizer may help us to reduce the amount of data we read from the file. We create a lazy query from the Reddit CSV data and apply some transformations. diff --git a/docs/user-guide/migration/pandas.md b/docs/user-guide/migration/pandas.md index 5c36b0a27..79b7e353c 100644 --- a/docs/user-guide/migration/pandas.md +++ b/docs/user-guide/migration/pandas.md @@ -18,7 +18,7 @@ objective. We believe the semantics of a query should not change by the state of In Polars a DataFrame will always be a 2D table with heterogeneous data-types. The data-types may have nesting, but the table itself will not. Operations like resampling will be done by specialized functions or methods that act like 'verbs' on a table explicitly -stating columns that 'verb' operates on. As such, it is our conviction that not having indices make things simpler, +stating the columns that that 'verb' operates on. As such, it is our conviction that not having indices make things simpler, more explicit, more readable and less error-prone. Note that an 'index' data structure as known in databases will be used by polars as an optimization technique. @@ -27,7 +27,7 @@ Note that an 'index' data structure as known in databases will be used by polars ### `Polars` uses Apache Arrow arrays to represent data in memory while `Pandas` uses `Numpy` arrays `Polars` represents data in memory with Arrow arrays while `Pandas` represents data in -memory in `Numpy` arrays. Apache Arrow is an emerging standard for in-memory columnar +memory with `Numpy` arrays. Apache Arrow is an emerging standard for in-memory columnar analytics that can accelerate data load times, reduce memory usage and accelerate calculations. @@ -37,21 +37,21 @@ calculations. `Polars` exploits the strong support for concurrency in Rust to run many operations in parallel. While some operations in `Pandas` are multi-threaded the core of the library -is single-threaded and an additional library such as `Dask` must be used to parallelise +is single-threaded and an additional library such as `Dask` must be used to parallelize operations. ### `Polars` can lazily evaluate queries and apply query optimization -Eager evaluation is where code is evaluated as soon as you run the code. Lazy evaluation -is where running a line of code means that the underlying logic is added to a query plan +Eager evaluation is when code is evaluated as soon as you run the code. Lazy evaluation +is when running a line of code means that the underlying logic is added to a query plan rather than being evaluated. `Polars` supports eager evaluation and lazy evaluation whereas `Pandas` only supports eager evaluation. The lazy evaluation mode is powerful because `Polars` carries out -automatic query optimization where it examines the query plan and looks for ways to +automatic query optimization when it examines the query plan and looks for ways to accelerate the query or reduce memory usage. -`Dask` also supports lazy evaluation where it generates a query plan. However, `Dask` +`Dask` also supports lazy evaluation when it generates a query plan. However, `Dask` does not carry out query optimization on the query plan. ## Key syntax differences @@ -65,7 +65,7 @@ polars != pandas If your `Polars` code looks like it could be `Pandas` code, it might run, but it likely runs slower than it should. -Let's go through some typical `Pandas` code and see how we might write that in `Polars`. +Let's go through some typical `Pandas` code and see how we might rewrite it in `Polars`. ### Selecting data diff --git a/docs/user-guide/misc/alternatives.md b/docs/user-guide/misc/alternatives.md index 0a4c5f5f4..cc9c41e5f 100644 --- a/docs/user-guide/misc/alternatives.md +++ b/docs/user-guide/misc/alternatives.md @@ -6,13 +6,13 @@ These are some tools that share similar functionality to what polars does. A very versatile tool for small data. Read [10 things I hate about pandas](https://wesmckinney.com/blog/apache-arrow-pandas-internals/) written by the author himself. Polars has solved all those 10 things. - Polars is a versatile tool for small and large data with a more predictable API, less ambiguous and stricter API. + Polars is a versatile tool for small and large data with a more predictable, less ambiguous, and stricter API. - Pandas the API The API of pandas was designed for in memory data. This makes it a poor fit for performant analysis on large data (read anything that does not fit into RAM). Any tool that tries to distribute that API will likely have a - suboptimal query plan compared to plans that follow from a declarative API like SQL or polars' API. + suboptimal query plan compared to plans that follow from a declarative API like SQL or Polars' API. - Dask @@ -40,27 +40,27 @@ These are some tools that share similar functionality to what polars does. - DuckDB Polars and DuckDB have many similarities. DuckDB is focused on providing an in-process OLAP Sqlite alternative, - polars is focused on providing a scalable `DataFrame` interface to many languages. Those different front-ends lead to - different optimization strategies and different algorithm prioritization. The interop between both is zero-copy. + Polars is focused on providing a scalable `DataFrame` interface to many languages. Those different front-ends lead to + different optimization strategies and different algorithm prioritization. The interoperability between both is zero-copy. See more: https://duckdb.org/docs/guides/python/polars - Spark Spark is designed for distributed workloads and uses the JVM. The setup for spark is complicated and the startup-time is slow. On a single machine Polars has much better performance characteristics. If you need to process TB's of data - spark is a better choice. + Spark is a better choice. - CuDF GPU's and CuDF are fast! - However, GPU's are not readily available and expensive in production. The amount of memory available on GPU often - is a fraction of available RAM. - This (and out-of-core) processing means that polars can handle much larger data-sets. + However, GPU's are not readily available and expensive in production. The amount of memory available on a GPU + is often a fraction of the available RAM. + This (and out-of-core) processing means that Polars can handle much larger data-sets. Next to that Polars can be close in [performance to CuDF](https://zakopilo.hatenablog.jp/entry/2023/02/04/220552). - CuDF doesn't optimize your query, so is not uncommon that on ETL jobs polars will be faster because it can elide - unneeded work and materialization's. + CuDF doesn't optimize your query, so is not uncommon that on ETL jobs Polars will be faster because it can elide + unneeded work and materializations. - Any Polars is written in Rust. This gives it strong safety, performance and concurrency guarantees. - Polars is written in a modular manner. Parts of polars can be used in other query program and can be added as a library. \ No newline at end of file + Polars is written in a modular manner. Parts of polars can be used in other query programs and can be added as a library. \ No newline at end of file diff --git a/docs/user-guide/misc/multiprocessing.md b/docs/user-guide/misc/multiprocessing.md index 7ca03d3f6..4e714135e 100644 --- a/docs/user-guide/misc/multiprocessing.md +++ b/docs/user-guide/misc/multiprocessing.md @@ -17,13 +17,13 @@ See [the optimizations section](../lazy/optimizations.md) for more optimizations ## When to use multiprocessing Although Polars is multithreaded, other libraries may be single-threaded. -When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to speed up. +When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to gain a speed up. ## The problem with the default multiprocessing config ### Summary -The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods a process pool can be created: +The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods to create a process pool: 1. spawn 1. fork @@ -31,7 +31,7 @@ The [Python multiprocessing documentation](https://docs.python.org/3/library/mul The description of fork is (as of 2022-10-15): -> The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic. +> The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic. > Available on Unix only. The default on Unix. @@ -52,7 +52,7 @@ Consider the example below, which is a slightly modified example posted on the [ {{code_block('user-guide/misc/multiprocess','example1',[])}} Using `fork` as the method, instead of `spawn`, will cause a dead lock. -Please note: Polars will not even start and raise the error on multiprocessing method being set wrong, but if the check would not be there, the deadlock would exist. +Please note: Polars will not even start and raise the error on multiprocessing method being set wrong, but if the check had not been there, the deadlock would exist. The fork method is equivalent to calling `os.fork()`, which is a system call as defined in [the POSIX standard](https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html): @@ -91,7 +91,7 @@ And more importantly, it actually works in combination with multithreaded librar Fourth, `spawn` starts a new process, and therefore it requires code to be importable, in contrast to `fork`. In particular, this means that when using `spawn` the relevant code should not be in the global scope, such as in Jupyter notebooks or in plain scripts. Hence in the examples above, we define functions where we spawn within, and run those functions from a `__main__` clause. -This is not an issue for typical projects, but in quick experimentation in notebooks it could fail. +This is not an issue for typical projects, but during quick experimentation in notebooks it could fail. ## References