Skip to content

Commit

Permalink
apacheGH-41611: [Docs][CI] Enable most sphinx-lint rules for document…
Browse files Browse the repository at this point in the history
…ation (apache#41612)

### Rationale for this change

apache#41611

### What changes are included in this PR?

- Update to pre-commit config to enable all checks except `dangling-hyphen`, `line-too-long` by default
- Associated fix docs

### Are these changes tested?

Yes, by building and looking at the docs locally.

### Are there any user-facing changes?

Just docs.
* GitHub Issue: apache#41611

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
  • Loading branch information
amoeba authored and vibhatha committed May 25, 2024
1 parent 27e5320 commit 2e9b06b
Show file tree
Hide file tree
Showing 31 changed files with 98 additions and 92 deletions.
10 changes: 8 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -136,5 +136,11 @@ repos:
rev: v0.9.1
hooks:
- id: sphinx-lint
files: ^docs/
args: ['--disable', 'all', '--enable', 'trailing-whitespace,missing-final-newline', 'docs']
files: ^docs/source
exclude: ^docs/source/python/generated
args: [
'--enable',
'all',
'--disable',
'dangling-hyphen,line-too-long',
]
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -535,7 +535,7 @@
#
# latex_appendices = []

# It false, will not define \strong, \code, itleref, \crossref ... but only
# It false, will not define \strong, \code, \titleref, \crossref ... but only
# \sphinxstrong, ..., \sphinxtitleref, ... To help avoid clash with user added
# packages.
#
Expand Down
10 changes: 5 additions & 5 deletions docs/source/cpp/acero/developer_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -327,8 +327,8 @@ An engine could choose to create a thread task for every execution of a node. H
this leads to problems with cache locality. For example, let's assume we have a basic plan consisting of three
exec nodes, scan, project, and then filter (this is a very common use case). Now let's assume there are 100 batches.
In a task-per-operator model we would have tasks like "Scan Batch 5", "Project Batch 5", and "Filter Batch 5". Each
of those tasks is potentially going to access the same data. For example, maybe the `project` and `filter` nodes need
to read the same column. A column which is intially created in a decode phase of the `scan` node. To maximize cache
of those tasks is potentially going to access the same data. For example, maybe the ``project`` and ``filter`` nodes need
to read the same column. A column which is intially created in a decode phase of the ``scan`` node. To maximize cache
utilization we would need to carefully schedule our tasks to ensure that all three of those tasks are run consecutively
and assigned to the same CPU core.

Expand Down Expand Up @@ -412,7 +412,7 @@ Ordered Execution
=================

Some nodes either establish an ordering to their outgoing batches or they need to be able to process batches in order.
Acero handles ordering using the `batch_index` property on an ExecBatch. If a node has a deterministic output order
Acero handles ordering using the ``batch_index`` property on an ExecBatch. If a node has a deterministic output order
then it should apply a batch index on batches that it emits. For example, the OrderByNode applies a new ordering to
batches (regardless of the incoming ordering). The scan node is able to attach an implicit ordering to batches which
reflects the order of the rows in the files being scanned.
Expand Down Expand Up @@ -461,8 +461,8 @@ Acero's tracing is currently half-implemented and there are major gaps in profil
effort at tracing with open telemetry and most of the necessary pieces are in place. The main thing currently lacking is
some kind of effective visualization of the tracing results.

In order to use the tracing that is present today you will need to build with Arrow with `ARROW_WITH_OPENTELEMETRY=ON`.
Then you will need to set the environment variable `ARROW_TRACING_BACKEND=otlp_http`. This will configure open telemetry
In order to use the tracing that is present today you will need to build with Arrow with ``ARROW_WITH_OPENTELEMETRY=ON``.
Then you will need to set the environment variable ``ARROW_TRACING_BACKEND=otlp_http``. This will configure open telemetry
to export trace results (as OTLP) to the HTTP endpoint http://localhost:4318/v1/traces. You will need to configure an
open telemetry collector to collect results on that endpoint and you will need to configure a trace viewer of some kind
such as Jaeger: https://www.jaegertracing.io/docs/1.21/opentelemetry/
Expand Down
26 changes: 13 additions & 13 deletions docs/source/cpp/acero/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -209,16 +209,16 @@ must have the same length. There are a few key differences from ExecBatch:

Both the record batch and the exec batch have strong ownership of the arrays & buffers

* An `ExecBatch` does not have a schema. This is because an `ExecBatch` is assumed to be
* An ``ExecBatch`` does not have a schema. This is because an ``ExecBatch`` is assumed to be
part of a stream of batches and the stream is assumed to have a consistent schema. So
the schema for an `ExecBatch` is typically stored in the ExecNode.
* Columns in an `ExecBatch` are either an `Array` or a `Scalar`. When a column is a `Scalar`
this means that the column has a single value for every row in the batch. An `ExecBatch`
the schema for an ``ExecBatch`` is typically stored in the ExecNode.
* Columns in an ``ExecBatch`` are either an ``Array`` or a ``Scalar``. When a column is a ``Scalar``
this means that the column has a single value for every row in the batch. An ``ExecBatch``
also has a length property which describes how many rows are in a batch. So another way to
view a `Scalar` is a constant array with `length` elements.
* An `ExecBatch` contains additional information used by the exec plan. For example, an
`index` can be used to describe a batch's position in an ordered stream. We expect
that `ExecBatch` will also evolve to contain additional fields such as a selection vector.
view a ``Scalar`` is a constant array with ``length`` elements.
* An ``ExecBatch`` contains additional information used by the exec plan. For example, an
``index`` can be used to describe a batch's position in an ordered stream. We expect
that ``ExecBatch`` will also evolve to contain additional fields such as a selection vector.

.. figure:: scalar_vs_array.svg

Expand All @@ -231,8 +231,8 @@ only zero copy if there are no scalars in the exec batch.

.. note::
Both Acero and the compute module have "lightweight" versions of batches and arrays.
In the compute module these are called `BatchSpan`, `ArraySpan`, and `BufferSpan`. In
Acero the concept is called `KeyColumnArray`. These types were developed concurrently
In the compute module these are called ``BatchSpan``, ``ArraySpan``, and ``BufferSpan``. In
Acero the concept is called ``KeyColumnArray``. These types were developed concurrently
and serve the same purpose. They aim to provide an array container that can be completely
stack allocated (provided the data type is non-nested) in order to avoid heap allocation
overhead. Ideally these two concepts will be merged someday.
Expand All @@ -247,9 +247,9 @@ execution of the nodes. Both ExecPlan and ExecNode are tied to the lifecycle of
They have state and are not expected to be restartable.

.. warning::
The structures within Acero, including `ExecBatch`, are still experimental. The `ExecBatch`
class should not be used outside of Acero. Instead, an `ExecBatch` should be converted to
a more standard structure such as a `RecordBatch`.
The structures within Acero, including ``ExecBatch``, are still experimental. The ``ExecBatch``
class should not be used outside of Acero. Instead, an ``ExecBatch`` should be converted to
a more standard structure such as a ``RecordBatch``.

Similarly, an ExecPlan is an internal concept. Users creating plans should be using Declaration
objects. APIs for consuming and executing plans should abstract away the details of the underlying
Expand Down
8 changes: 4 additions & 4 deletions docs/source/cpp/acero/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -455,8 +455,8 @@ can be selected from :ref:`this list of aggregation functions
will be added which should alleviate this constraint.

The aggregation can provide results as a group or scalar. For instances,
an operation like `hash_count` provides the counts per each unique record
as a grouped result while an operation like `sum` provides a single record.
an operation like ``hash_count`` provides the counts per each unique record
as a grouped result while an operation like ``sum`` provides a single record.

Scalar Aggregation example:

Expand Down Expand Up @@ -490,7 +490,7 @@ caller will repeatedly call this function until the generator function is exhaus
will accumulate in memory. An execution plan should only have one
"terminal" node (one sink node). An :class:`ExecPlan` can terminate early due to cancellation or
an error, before the output is fully consumed. However, the plan can be safely destroyed independently
of the sink, which will hold the unconsumed batches by `exec_plan->finished()`.
of the sink, which will hold the unconsumed batches by ``exec_plan->finished()``.

As a part of the Source Example, the Sink operation is also included;

Expand All @@ -515,7 +515,7 @@ The consuming function may be called before a previous invocation has completed.
function does not run quickly enough then many concurrent executions could pile up, blocking the
CPU thread pool. The execution plan will not be marked finished until all consuming function callbacks
have been completed.
Once all batches have been delivered the execution plan will wait for the `finish` future to complete
Once all batches have been delivered the execution plan will wait for the ``finish`` future to complete
before marking the execution plan finished. This allows for workflows where the consumption function
converts batches into async tasks (this is currently done internally for the dataset write node).

Expand Down
2 changes: 1 addition & 1 deletion docs/source/cpp/build_system.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ file into an executable linked with the Arrow C++ shared library:
.. code-block:: makefile
my_example: my_example.cc
$(CXX) -o $@ $(CXXFLAGS) $< $$(pkg-config --cflags --libs arrow)
$(CXX) -o $@ $(CXXFLAGS) $< $$(pkg-config --cflags --libs arrow)
Many build systems support pkg-config. For example:

Expand Down
18 changes: 9 additions & 9 deletions docs/source/cpp/compute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -514,8 +514,8 @@ Mixed time resolution temporal inputs will be cast to finest input resolution.
+------------+---------------------------------------------+

It's compatible with Redshift's decimal promotion rules. All decimal digits
are preserved for `add`, `subtract` and `multiply` operations. The result
precision of `divide` is at least the sum of precisions of both operands with
are preserved for ``add``, ``subtract`` and ``multiply`` operations. The result
precision of ``divide`` is at least the sum of precisions of both operands with
enough scale kept. Error is returned if the result precision is beyond the
decimal value range.

Expand Down Expand Up @@ -1029,7 +1029,7 @@ These functions trim off characters on both sides (trim), or the left (ltrim) or
+--------------------------+------------+-------------------------+---------------------+----------------------------------------+---------+

* \(1) Only characters specified in :member:`TrimOptions::characters` will be
trimmed off. Both the input string and the `characters` argument are
trimmed off. Both the input string and the ``characters`` argument are
interpreted as ASCII characters.

* \(2) Only trim off ASCII whitespace characters (``'\t'``, ``'\n'``, ``'\v'``,
Expand Down Expand Up @@ -1570,7 +1570,7 @@ is the same, even though the UTC years would be different.
Timezone handling
~~~~~~~~~~~~~~~~~

`assume_timezone` function is meant to be used when an external system produces
``assume_timezone`` function is meant to be used when an external system produces
"timezone-naive" timestamps which need to be converted to "timezone-aware"
timestamps (see for example the `definition
<https://docs.python.org/3/library/datetime.html#aware-and-naive-objects>`__
Expand All @@ -1581,11 +1581,11 @@ Input timestamps are assumed to be relative to the timezone given in
UTC-relative timestamps with the timezone metadata set to the above value.
An error is returned if the timestamps already have the timezone metadata set.

`local_timestamp` function converts UTC-relative timestamps to local "timezone-naive"
``local_timestamp`` function converts UTC-relative timestamps to local "timezone-naive"
timestamps. The timezone is taken from the timezone metadata of the input
timestamps. This function is the inverse of `assume_timezone`. Please note:
timestamps. This function is the inverse of ``assume_timezone``. Please note:
**all temporal functions already operate on timestamps as if they were in local
time of the metadata provided timezone**. Using `local_timestamp` is only meant to be
time of the metadata provided timezone**. Using ``local_timestamp`` is only meant to be
used when an external system expects local timestamps.

+-----------------+-------+-------------+---------------+---------------------------------+-------+
Expand Down Expand Up @@ -1649,8 +1649,8 @@ overflow is detected.

* \(1) CumulativeOptions has two optional parameters. The first parameter
:member:`CumulativeOptions::start` is a starting value for the running
accumulation. It has a default value of 0 for `sum`, 1 for `prod`, min of
input type for `max`, and max of input type for `min`. Specified values of
accumulation. It has a default value of 0 for ``sum``, 1 for ``prod``, min of
input type for ``max``, and max of input type for ``min``. Specified values of
``start`` must be castable to the input type. The second parameter
:member:`CumulativeOptions::skip_nulls` is a boolean. When set to
false (the default), the first encountered null is propagated. When set to
Expand Down
2 changes: 1 addition & 1 deletion docs/source/developers/cpp/building.rst
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ depends on ``python`` being available).

On some Linux distributions, running the test suite might require setting an
explicit locale. If you see any locale-related errors, try setting the
environment variable (which requires the `locales` package or equivalent):
environment variable (which requires the ``locales`` package or equivalent):

.. code-block::
Expand Down
2 changes: 1 addition & 1 deletion docs/source/developers/documentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ Build the docs in the target directory:
sphinx-build ./source/developers ./source/developers/_build -c ./source -D master_doc=temp_index
This builds everything in the target directory to a folder inside of it
called ``_build`` using the config file in the `source` directory.
called ``_build`` using the config file in the ``source`` directory.

Once you have verified the HTML documents, you can remove temporary index file:

Expand Down
4 changes: 2 additions & 2 deletions docs/source/developers/guide/step_by_step/arrow_codebase.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,8 +99,8 @@ can be called from a function in another language. After a function is defined
C++ we must create the binding manually to use it in that implementation.

.. note::
There is much you can learn by checking **Pull Requests**
and **unit tests** for similar issues.
There is much you can learn by checking **Pull Requests**
and **unit tests** for similar issues.

.. tab-set::

Expand Down
8 changes: 4 additions & 4 deletions docs/source/developers/guide/step_by_step/set_up.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,10 +118,10 @@ Should give you a result similar to this:

.. code:: console
origin https://github.com/<your username>/arrow.git (fetch)
origin https://github.com/<your username>/arrow.git (push)
upstream https://github.com/apache/arrow (fetch)
upstream https://github.com/apache/arrow (push)
origin https://github.com/<your username>/arrow.git (fetch)
origin https://github.com/<your username>/arrow.git (push)
upstream https://github.com/apache/arrow (fetch)
upstream https://github.com/apache/arrow (push)
If you did everything correctly, you should now have a copy of the code
in the ``arrow`` directory and two remotes that refer to your own GitHub
Expand Down
4 changes: 2 additions & 2 deletions docs/source/developers/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ If there is consensus and there is a Release Manager willing to take the effort
the release a patch release can be created.

Committers can tag issues that should be included on the next patch release using the
`backport-candidate` label. Is the responsability of the author or the committer to add the
``backport-candidate`` label. Is the responsability of the author or the committer to add the
label to the issue to help the Release Manager identify the issues that should be backported.

If a specific issue is identified as the reason to create a patch release the Release Manager
Expand All @@ -117,7 +117,7 @@ Be sure to go through on the following checklist:
#. Create milestone
#. Create maintenance branch
#. Include issue that was requested as requiring new patch release
#. Add new milestone to issues with `backport-candidate` label
#. Add new milestone to issues with ``backport-candidate`` label
#. cherry-pick issues into maintenance branch

Creating a Release Candidate
Expand Down
4 changes: 2 additions & 2 deletions docs/source/format/CanonicalExtensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Official List
Fixed shape tensor
==================

* Extension name: `arrow.fixed_shape_tensor`.
* Extension name: ``arrow.fixed_shape_tensor``.

* The storage type of the extension: ``FixedSizeList`` where:

Expand Down Expand Up @@ -153,7 +153,7 @@ Fixed shape tensor
Variable shape tensor
=====================

* Extension name: `arrow.variable_shape_tensor`.
* Extension name: ``arrow.variable_shape_tensor``.

* The storage type of the extension is: ``StructArray`` where struct
is composed of **data** and **shape** fields describing a single
Expand Down
6 changes: 3 additions & 3 deletions docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ Each value in this layout consists of 0 or more bytes. While primitive
arrays have a single values buffer, variable-size binary have an
**offsets** buffer and **data** buffer.

The offsets buffer contains `length + 1` signed integers (either
The offsets buffer contains ``length + 1`` signed integers (either
32-bit or 64-bit, depending on the logical type), which encode the
start position of each slot in the data buffer. The length of the
value in each slot is computed using the difference between the offset
Expand Down Expand Up @@ -374,7 +374,7 @@ locations are indicated using a **views** buffer, which may point to one
of potentially several **data** buffers or may contain the characters
inline.

The views buffer contains `length` view structures with the following layout:
The views buffer contains ``length`` view structures with the following layout:

::

Expand All @@ -394,7 +394,7 @@ should be interpreted.

In the short string case the string's bytes are inlined — stored inside the
view itself, in the twelve bytes which follow the length. Any remaining bytes
after the string itself are padded with `0`.
after the string itself are padded with ``0``.

In the long string case, a buffer index indicates which data buffer
stores the data bytes and an offset indicates where in that buffer the
Expand Down
2 changes: 1 addition & 1 deletion docs/source/format/FlightSql.rst
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ in the ``app_metadata`` field of the Flight RPC ``PutResult`` returned.

When used with DoPut: load the stream of Arrow record batches into
the specified target table and return the number of rows ingested
via a `DoPutUpdateResult` message.
via a ``DoPutUpdateResult`` message.

Flight Server Session Management
--------------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/source/format/Integration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,7 @@ integration testing actually tests.

There are two types of integration test cases: the ones populated on the fly
by the data generator in the Archery utility, and *gold* files that exist
in the `arrow-testing <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration>`
in the `arrow-testing <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration>`_
repository.

Data Generator Tests
Expand Down
2 changes: 1 addition & 1 deletion docs/source/java/algorithm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ for fixed width and variable width vectors, respectively. Both algorithms run in

3. **Index sorter**: this sorter does not actually sort the vector. Instead, it returns an integer
vector, which correspond to indices of vector elements in sorted order. With the index vector, one can
easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the ``k``th
easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the ``k`` th
smallest value in the vector. Index sorting is supported by ``org.apache.arrow.algorithm.sort.IndexSorter``,
which runs in ``O(nlog(n))`` time. It is applicable to vectors of any type.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/java/flight_sql_jdbc_driver.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ the Flight SQL service as gRPC headers. For example, the following URI ::

This will connect without authentication or encryption, to a Flight
SQL service running on ``localhost`` on port 12345. Each request will
also include a `database=mydb` gRPC header.
also include a ``database=mydb`` gRPC header.

Connection parameters may also be supplied using the Properties object
when using the JDBC Driver Manager to connect. When supplying using
Expand Down
Loading

0 comments on commit 2e9b06b

Please sign in to comment.