clarity and couple typo fixes (#356)

Co-authored-by: chielP <[email protected]>
pola-rs · Jul 5, 2023 · 821dc0c · 821dc0c
1 parent 0ccaa84
commit 821dc0c
Show file tree

Hide file tree

Showing 10 changed files with 34 additions and 34 deletions.
diff --git a/docs/user-guide/expressions/user-defined-functions.md b/docs/user-guide/expressions/user-defined-functions.md
@@ -1,7 +1,7 @@
 # User Defined functions
 
-You should be convinced by now that polar expressions are so powerful and flexible that the need for custom python functions
-is much less needed than you might need in other libraries.
+You should be convinced by now that polar expressions are so powerful and flexible that there is much less need for custom python functions
+than in other libraries.
 
 Still, you need to have the power to be able to pass an expression's state to a third party library or apply your black box function
 over data in polars.

diff --git a/docs/user-guide/expressions/window.md b/docs/user-guide/expressions/window.md
@@ -1,7 +1,7 @@
 # Window functions 
 
 Window functions are expressions with superpowers. They allow you to perform aggregations on groups in the
-`select` context. Let's get a feel of what that means. First we create a dataset. The dataset loaded in the
+`select` context. Let's get a feel for what that means. First we create a dataset. The dataset loaded in the
 snippet below contains information about pokemon:
 
 {{code_block('user-guide/expressions/window','pokemon',['read_csv'])}}

diff --git a/docs/user-guide/io/csv.md b/docs/user-guide/io/csv.md
@@ -19,5 +19,5 @@ file and instead returns a lazy computation holder called a `LazyFrame`.
 
 {{code_block('user-guide/io/csv','scan',['scan_csv'])}}
 
-If you want to know why this is desirable, you can read more about those `Polars`
+If you want to know why this is desirable, you can read more about these `Polars`
 optimizations [here](../concepts/lazy-vs-eager.md).
diff --git a/docs/user-guide/io/multiple.md b/docs/user-guide/io/multiple.md
@@ -2,7 +2,7 @@
 
 `Polars` can deal with multiple files differently depending on your needs and memory strain.
 
-Let's create some files to give use some context:
+Let's create some files to give us some context:
 
 {{code_block('user-guide/io/multiple','create',['write_csv'])}}
 

diff --git a/docs/user-guide/lazy/execution.md b/docs/user-guide/lazy/execution.md
@@ -34,7 +34,7 @@ shape: (14_029, 6)
 └─────────┴───────────────────────────┴─────────────┴────────────┴───────────────┴────────────┘
 ```
 
-Above we see that from the 10 Million rows there 14,029 rows match our predicate.
+Above we see that from the 10 million rows there are 14,029 rows that match our predicate.
 
 With the default `collect` method Polars processes all of your data as one batch. This means that all the data has to fit into your available memory at the point of peak memory usage in your query.
 

diff --git a/docs/user-guide/lazy/query_plan.md b/docs/user-guide/lazy/query_plan.md
@@ -11,7 +11,7 @@ We can understand both the non-optimized and optimized query plans with visualiz
 
 ### Graphviz visualization
 
-First we visualise the non-optimized plan by setting `optimized=False`.
+First we visualize the non-optimized plan by setting `optimized=False`.
 
 {{code_block('user-guide/lazy/query_plan','plan',['show_graph'])}}
 
@@ -26,7 +26,7 @@ First we visualise the non-optimized plan by setting `optimized=False`.
 --8<-- "python/user-guide/lazy/query_plan.py:createplan"
 ```
 
-The query plan visualisation should be read from bottom to top. In the visualisation:
+The query plan visualization should be read from bottom to top. In the visualization:
 
 - each box corresponds to a stage in the query plan
 - the `sigma` stands for `SELECTION` and indicates any filter conditions
@@ -55,7 +55,7 @@ The printed plan should also be read from bottom to top. This non-optimized plan
 
 ## Optimized query plan
 
-Now we visualise the optimized plan with `show_graph`.
+Now we visualize the optimized plan with `show_graph`.
 
 {{code_block('user-guide/lazy/query_plan','show',['show_graph'])}}
 

diff --git a/docs/user-guide/lazy/using.md b/docs/user-guide/lazy/using.md
@@ -10,7 +10,7 @@ Here we see how to use the lazy API starting from either a file or an existing `
 
 ## Using the lazy API from a file
 
-In the ideal case, we use the lazy API right from a file as the query optimizer may help us to reduce the amount of data we read from the file.
+In the ideal case we would use the lazy API right from a file as the query optimizer may help us to reduce the amount of data we read from the file.
 
 We create a lazy query from the Reddit CSV data and apply some transformations.
 

diff --git a/docs/user-guide/migration/pandas.md b/docs/user-guide/migration/pandas.md
@@ -18,7 +18,7 @@ objective. We believe the semantics of a query should not change by the state of
 In Polars a DataFrame will always be a 2D table with heterogeneous data-types. The data-types may have nesting, but the
 table itself will not.
 Operations like resampling will be done by specialized functions or methods that act like 'verbs' on a table explicitly
-stating columns that 'verb' operates on. As such, it is our conviction that not having indices make things simpler,
+stating the columns that that 'verb' operates on. As such, it is our conviction that not having indices make things simpler,
 more explicit, more readable and less error-prone.
 
 Note that an 'index' data structure as known in databases will be used by polars as an optimization technique.
@@ -27,7 +27,7 @@ Note that an 'index' data structure as known in databases will be used by polars
 ### `Polars` uses Apache Arrow arrays to represent data in memory while `Pandas` uses `Numpy` arrays
 
 `Polars` represents data in memory with Arrow arrays while `Pandas` represents data in
-memory in `Numpy` arrays. Apache Arrow is an emerging standard for in-memory columnar
+memory with `Numpy` arrays. Apache Arrow is an emerging standard for in-memory columnar
 analytics that can accelerate data load times, reduce memory usage and accelerate
 calculations.
 
@@ -37,21 +37,21 @@ calculations.
 
 `Polars` exploits the strong support for concurrency in Rust to run many operations in
 parallel. While some operations in `Pandas` are multi-threaded the core of the library
-is single-threaded and an additional library such as `Dask` must be used to parallelise
+is single-threaded and an additional library such as `Dask` must be used to parallelize
 operations.
 
 ### `Polars` can lazily evaluate queries and apply query optimization
 
-Eager evaluation is where code is evaluated as soon as you run the code. Lazy evaluation
-is where running a line of code means that the underlying logic is added to a query plan
+Eager evaluation is when code is evaluated as soon as you run the code. Lazy evaluation
+is when running a line of code means that the underlying logic is added to a query plan
 rather than being evaluated.
 
 `Polars` supports eager evaluation and lazy evaluation whereas `Pandas` only supports
 eager evaluation. The lazy evaluation mode is powerful because `Polars` carries out
-automatic query optimization where it examines the query plan and looks for ways to
+automatic query optimization when it examines the query plan and looks for ways to
 accelerate the query or reduce memory usage.
 
-`Dask` also supports lazy evaluation where it generates a query plan. However, `Dask`
+`Dask` also supports lazy evaluation when it generates a query plan. However, `Dask`
 does not carry out query optimization on the query plan.
 
 ## Key syntax differences
@@ -65,7 +65,7 @@ polars != pandas
 If your `Polars` code looks like it could be `Pandas` code, it might run, but it likely
 runs slower than it should.
 
-Let's go through some typical `Pandas` code and see how we might write that in `Polars`.
+Let's go through some typical `Pandas` code and see how we might rewrite it in `Polars`.
 
 ### Selecting data
 

diff --git a/docs/user-guide/misc/alternatives.md b/docs/user-guide/misc/alternatives.md
@@ -6,13 +6,13 @@ These are some tools that share similar functionality to what polars does.
 
     A very versatile tool for small data. Read [10 things I hate about pandas](https://wesmckinney.com/blog/apache-arrow-pandas-internals/)
     written by the author himself. Polars has solved all those 10 things.
-    Polars is a versatile tool for small and large data with a more predictable API, less ambiguous and stricter API.
+    Polars is a versatile tool for small and large data with a more predictable, less ambiguous, and stricter API.
 
 - Pandas the API
 
     The API of pandas was designed for in memory data. This makes it a poor fit for performant analysis on large data
     (read anything that does not fit into RAM). Any tool that tries to distribute that API will likely have a
-    suboptimal query plan compared to plans that follow from a declarative API like SQL or polars' API.
+    suboptimal query plan compared to plans that follow from a declarative API like SQL or Polars' API.
 
 - Dask
 
@@ -40,27 +40,27 @@ These are some tools that share similar functionality to what polars does.
 - DuckDB
 
     Polars and DuckDB have many similarities. DuckDB is focused on providing an in-process OLAP Sqlite alternative,
-    polars is focused on providing a scalable `DataFrame` interface to many languages. Those different front-ends lead to
-    different optimization strategies and different algorithm prioritization. The interop between both is zero-copy.
+    Polars is focused on providing a scalable `DataFrame` interface to many languages. Those different front-ends lead to
+    different optimization strategies and different algorithm prioritization. The interoperability between both is zero-copy.
     See more: https://duckdb.org/docs/guides/python/polars
 
 - Spark
 
     Spark is designed for distributed workloads and uses the JVM. The setup for spark is complicated and the startup-time
     is slow. On a single machine Polars has much better performance characteristics. If you need to process TB's of data
-    spark is a better choice.
+    Spark is a better choice.
 
 - CuDF
 
     GPU's and CuDF are fast!
-    However, GPU's are not readily available and expensive in production. The amount of memory available on GPU often
-    is a fraction of available RAM.
-    This (and out-of-core) processing means that polars can handle much larger data-sets.
+    However, GPU's are not readily available and expensive in production. The amount of memory available on a GPU
+    is often a fraction of the available RAM.
+    This (and out-of-core) processing means that Polars can handle much larger data-sets.
     Next to that Polars can be close in [performance to CuDF](https://zakopilo.hatenablog.jp/entry/2023/02/04/220552).
-    CuDF doesn't optimize your query, so is not uncommon that on ETL jobs polars will be faster because it can elide
-    unneeded work and materialization's.
+    CuDF doesn't optimize your query, so is not uncommon that on ETL jobs Polars will be faster because it can elide
+    unneeded work and materializations.
 
 - Any
 
     Polars is written in Rust. This gives it strong safety, performance and concurrency guarantees.
-    Polars is written in a modular manner. Parts of polars can be used in other query program and can be added as a library.
+    Polars is written in a modular manner. Parts of polars can be used in other query programs and can be added as a library.
diff --git a/docs/user-guide/misc/multiprocessing.md b/docs/user-guide/misc/multiprocessing.md
@@ -17,21 +17,21 @@ See [the optimizations section](../lazy/optimizations.md) for more optimizations
 ## When to use multiprocessing
 
 Although Polars is multithreaded, other libraries may be single-threaded.
-When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to speed up.
+When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to gain a speed up.
 
 ## The problem with the default multiprocessing config
 
 ### Summary
 
-The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods a process pool can be created:
+The [Python multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) lists the three methods to create a process pool:
 
 1. spawn
 1. fork
 1. forkserver
 
 The description of fork is (as of 2022-10-15):
 
-> The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the  parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
+> The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
 
 > Available on Unix only. The default on Unix.
 
@@ -52,7 +52,7 @@ Consider the example below, which is a slightly modified example posted on the [
 {{code_block('user-guide/misc/multiprocess','example1',[])}}
 
 Using `fork` as the method, instead of `spawn`, will cause a dead lock.
-Please note: Polars will not even start and raise the error on multiprocessing method being set wrong, but if the check would not be there, the deadlock would exist.
+Please note: Polars will not even start and raise the error on multiprocessing method being set wrong, but if the check had not been there, the deadlock would exist.
 
 The fork method is equivalent to calling `os.fork()`, which is a system call as defined in [the POSIX standard](https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html):
 
@@ -91,7 +91,7 @@ And more importantly, it actually works in combination with multithreaded librar
 Fourth, `spawn` starts a new process, and therefore it requires code to be importable, in contrast to `fork`.
 In particular, this means that when using `spawn` the relevant code should not be in the global scope, such as in Jupyter notebooks or in plain scripts.
 Hence in the examples above, we define functions where we spawn within, and run those functions from a `__main__` clause.
-This is not an issue for typical projects, but in quick experimentation in notebooks it could fail.
+This is not an issue for typical projects, but during quick experimentation in notebooks it could fail.
 
 ## References