Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

Commit

Permalink
clarity and couple typo fixes (#356)
Browse files Browse the repository at this point in the history
Co-authored-by: chielP <[email protected]>
  • Loading branch information
zm711 and c-peters authored Jul 5, 2023
1 parent 0ccaa84 commit 821dc0c
Show file tree
Hide file tree
Showing 10 changed files with 34 additions and 34 deletions.
4 changes: 2 additions & 2 deletions docs/user-guide/expressions/user-defined-functions.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# User Defined functions

You should be convinced by now that polar expressions are so powerful and flexible that the need for custom python functions
is much less needed than you might need in other libraries.
You should be convinced by now that polar expressions are so powerful and flexible that there is much less need for custom python functions
than in other libraries.

Still, you need to have the power to be able to pass an expression's state to a third party library or apply your black box function
over data in polars.
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/expressions/window.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Window functions

Window functions are expressions with superpowers. They allow you to perform aggregations on groups in the
`select` context. Let's get a feel of what that means. First we create a dataset. The dataset loaded in the
`select` context. Let's get a feel for what that means. First we create a dataset. The dataset loaded in the
snippet below contains information about pokemon:

{{code_block('user-guide/expressions/window','pokemon',['read_csv'])}}
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/io/csv.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ file and instead returns a lazy computation holder called a `LazyFrame`.

{{code_block('user-guide/io/csv','scan',['scan_csv'])}}

If you want to know why this is desirable, you can read more about those `Polars`
If you want to know why this is desirable, you can read more about these `Polars`
optimizations [here](../concepts/lazy-vs-eager.md).
2 changes: 1 addition & 1 deletion docs/user-guide/io/multiple.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

`Polars` can deal with multiple files differently depending on your needs and memory strain.

Let's create some files to give use some context:
Let's create some files to give us some context:

{{code_block('user-guide/io/multiple','create',['write_csv'])}}

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/lazy/execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ shape: (14_029, 6)
└─────────┴───────────────────────────┴─────────────┴────────────┴───────────────┴────────────┘
```

Above we see that from the 10 Million rows there 14,029 rows match our predicate.
Above we see that from the 10 million rows there are 14,029 rows that match our predicate.

With the default `collect` method Polars processes all of your data as one batch. This means that all the data has to fit into your available memory at the point of peak memory usage in your query.

Expand Down
6 changes: 3 additions & 3 deletions docs/user-guide/lazy/query_plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ We can understand both the non-optimized and optimized query plans with visualiz

### Graphviz visualization

First we visualise the non-optimized plan by setting `optimized=False`.
First we visualize the non-optimized plan by setting `optimized=False`.

{{code_block('user-guide/lazy/query_plan','plan',['show_graph'])}}

Expand All @@ -26,7 +26,7 @@ First we visualise the non-optimized plan by setting `optimized=False`.
--8<-- "python/user-guide/lazy/query_plan.py:createplan"
```

The query plan visualisation should be read from bottom to top. In the visualisation:
The query plan visualization should be read from bottom to top. In the visualization:

- each box corresponds to a stage in the query plan
- the `sigma` stands for `SELECTION` and indicates any filter conditions
Expand Down Expand Up @@ -55,7 +55,7 @@ The printed plan should also be read from bottom to top. This non-optimized plan

## Optimized query plan

Now we visualise the optimized plan with `show_graph`.
Now we visualize the optimized plan with `show_graph`.

{{code_block('user-guide/lazy/query_plan','show',['show_graph'])}}

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/lazy/using.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Here we see how to use the lazy API starting from either a file or an existing `

## Using the lazy API from a file

In the ideal case, we use the lazy API right from a file as the query optimizer may help us to reduce the amount of data we read from the file.
In the ideal case we would use the lazy API right from a file as the query optimizer may help us to reduce the amount of data we read from the file.

We create a lazy query from the Reddit CSV data and apply some transformations.

Expand Down
16 changes: 8 additions & 8 deletions docs/user-guide/migration/pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ objective. We believe the semantics of a query should not change by the state of
In Polars a DataFrame will always be a 2D table with heterogeneous data-types. The data-types may have nesting, but the
table itself will not.
Operations like resampling will be done by specialized functions or methods that act like 'verbs' on a table explicitly
stating columns that 'verb' operates on. As such, it is our conviction that not having indices make things simpler,
stating the columns that that 'verb' operates on. As such, it is our conviction that not having indices make things simpler,
more explicit, more readable and less error-prone.

Note that an 'index' data structure as known in databases will be used by polars as an optimization technique.
Expand All @@ -27,7 +27,7 @@ Note that an 'index' data structure as known in databases will be used by polars
### `Polars` uses Apache Arrow arrays to represent data in memory while `Pandas` uses `Numpy` arrays

`Polars` represents data in memory with Arrow arrays while `Pandas` represents data in
memory in `Numpy` arrays. Apache Arrow is an emerging standard for in-memory columnar
memory with `Numpy` arrays. Apache Arrow is an emerging standard for in-memory columnar
analytics that can accelerate data load times, reduce memory usage and accelerate
calculations.

Expand All @@ -37,21 +37,21 @@ calculations.

`Polars` exploits the strong support for concurrency in Rust to run many operations in
parallel. While some operations in `Pandas` are multi-threaded the core of the library
is single-threaded and an additional library such as `Dask` must be used to parallelise
is single-threaded and an additional library such as `Dask` must be used to parallelize
operations.

### `Polars` can lazily evaluate queries and apply query optimization

Eager evaluation is where code is evaluated as soon as you run the code. Lazy evaluation
is where running a line of code means that the underlying logic is added to a query plan
Eager evaluation is when code is evaluated as soon as you run the code. Lazy evaluation
is when running a line of code means that the underlying logic is added to a query plan
rather than being evaluated.

`Polars` supports eager evaluation and lazy evaluation whereas `Pandas` only supports
eager evaluation. The lazy evaluation mode is powerful because `Polars` carries out
automatic query optimization where it examines the query plan and looks for ways to
automatic query optimization when it examines the query plan and looks for ways to
accelerate the query or reduce memory usage.

`Dask` also supports lazy evaluation where it generates a query plan. However, `Dask`
`Dask` also supports lazy evaluation when it generates a query plan. However, `Dask`
does not carry out query optimization on the query plan.

## Key syntax differences
Expand All @@ -65,7 +65,7 @@ polars != pandas
If your `Polars` code looks like it could be `Pandas` code, it might run, but it likely
runs slower than it should.

Let's go through some typical `Pandas` code and see how we might write that in `Polars`.
Let's go through some typical `Pandas` code and see how we might rewrite it in `Polars`.

### Selecting data

Expand Down
22 changes: 11 additions & 11 deletions docs/user-guide/misc/alternatives.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ These are some tools that share similar functionality to what polars does.

A very versatile tool for small data. Read [10 things I hate about pandas](https://wesmckinney.com/blog/apache-arrow-pandas-internals/)
written by the author himself. Polars has solved all those 10 things.
Polars is a versatile tool for small and large data with a more predictable API, less ambiguous and stricter API.
Polars is a versatile tool for small and large data with a more predictable, less ambiguous, and stricter API.

- Pandas the API

The API of pandas was designed for in memory data. This makes it a poor fit for performant analysis on large data
(read anything that does not fit into RAM). Any tool that tries to distribute that API will likely have a
suboptimal query plan compared to plans that follow from a declarative API like SQL or polars' API.
suboptimal query plan compared to plans that follow from a declarative API like SQL or Polars' API.

- Dask

Expand Down Expand Up @@ -40,27 +40,27 @@ These are some tools that share similar functionality to what polars does.
- DuckDB

Polars and DuckDB have many similarities. DuckDB is focused on providing an in-process OLAP Sqlite alternative,
polars is focused on providing a scalable `DataFrame` interface to many languages. Those different front-ends lead to
different optimization strategies and different algorithm prioritization. The interop between both is zero-copy.
Polars is focused on providing a scalable `DataFrame` interface to many languages. Those different front-ends lead to
different optimization strategies and different algorithm prioritization. The interoperability between both is zero-copy.
See more: https://duckdb.org/docs/guides/python/polars

- Spark

Spark is designed for distributed workloads and uses the JVM. The setup for spark is complicated and the startup-time
is slow. On a single machine Polars has much better performance characteristics. If you need to process TB's of data
spark is a better choice.
Spark is a better choice.

- CuDF

GPU's and CuDF are fast!
However, GPU's are not readily available and expensive in production. The amount of memory available on GPU often
is a fraction of available RAM.
This (and out-of-core) processing means that polars can handle much larger data-sets.
However, GPU's are not readily available and expensive in production. The amount of memory available on a GPU
is often a fraction of the available RAM.
This (and out-of-core) processing means that Polars can handle much larger data-sets.
Next to that Polars can be close in [performance to CuDF](https://zakopilo.hatenablog.jp/entry/2023/02/04/220552).
CuDF doesn't optimize your query, so is not uncommon that on ETL jobs polars will be faster because it can elide
unneeded work and materialization's.
CuDF doesn't optimize your query, so is not uncommon that on ETL jobs Polars will be faster because it can elide
unneeded work and materializations.

- Any

Polars is written in Rust. This gives it strong safety, performance and concurrency guarantees.
Polars is written in a modular manner. Parts of polars can be used in other query program and can be added as a library.
Polars is written in a modular manner. Parts of polars can be used in other query programs and can be added as a library.
10 changes: 5 additions & 5 deletions docs/user-guide/misc/multiprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,21 +17,21 @@ See [the optimizations section](../lazy/optimizations.md) for more optimizations
## When to use multiprocessing

Although Polars is multithreaded, other libraries may be single-threaded.
When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to speed up.
When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to gain a speed up.

## The problem with the default multiprocessing config

### Summary

The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods a process pool can be created:
The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods to create a process pool:

1. spawn
1. fork
1. forkserver

The description of fork is (as of 2022-10-15):

> The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
> The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
> Available on Unix only. The default on Unix.
Expand All @@ -52,7 +52,7 @@ Consider the example below, which is a slightly modified example posted on the [
{{code_block('user-guide/misc/multiprocess','example1',[])}}

Using `fork` as the method, instead of `spawn`, will cause a dead lock.
Please note: Polars will not even start and raise the error on multiprocessing method being set wrong, but if the check would not be there, the deadlock would exist.
Please note: Polars will not even start and raise the error on multiprocessing method being set wrong, but if the check had not been there, the deadlock would exist.

The fork method is equivalent to calling `os.fork()`, which is a system call as defined in [the POSIX standard](https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html):

Expand Down Expand Up @@ -91,7 +91,7 @@ And more importantly, it actually works in combination with multithreaded librar
Fourth, `spawn` starts a new process, and therefore it requires code to be importable, in contrast to `fork`.
In particular, this means that when using `spawn` the relevant code should not be in the global scope, such as in Jupyter notebooks or in plain scripts.
Hence in the examples above, we define functions where we spawn within, and run those functions from a `__main__` clause.
This is not an issue for typical projects, but in quick experimentation in notebooks it could fail.
This is not an issue for typical projects, but during quick experimentation in notebooks it could fail.

## References

Expand Down

0 comments on commit 821dc0c

Please sign in to comment.