Skip to content

Commit

Permalink
ARROW-3731: MVP to read parquet in R library
Browse files Browse the repository at this point in the history
I am contributing to [Arrow 3731](https://issues.apache.org/jira/browse/ARROW-3731). This PR has the minimum functionality to read parquet files into an arrow::Table, which can then be converted to a tibble. Multiple parquet files can be read inside `lapply`, and then concatenated at the end.

Steps to compile
1) Build arrow and parquet c++ projects
2) In R run `devtools::load_all()`

What I could use help with:
The biggest challenge for me is my lack of experience with pkg-config. The R library has a `configure` file which uses pkg-config to figure out what c++ libraries to link to. Currently, `configure` looks up the Arrow project and links to -larrow only. We need it to also link to -lparquet. I do not know how to modify pkg-config's metadata to let it know to link to both -larrow and -lparquet

Author: Jeffrey Wong <[email protected]>
Author: Romain Francois <[email protected]>
Author: jeffwong-nflx <[email protected]>

Closes apache#3230 from jeffwong-nflx/master and squashes the following commits:

c67fa3d <jeffwong-nflx> Merge pull request #3 from jeffwong-nflx/cleanup
1df3026 <Jeffrey Wong> don't hard code -larrow and -lparquet
8ccaa51 <Jeffrey Wong> cleanup
75ba5c9 <Jeffrey Wong> add contributor
56adad2 <jeffwong-nflx> Merge pull request #2 from romainfrancois/3731/parquet-2
7d6e64d <Romain Francois> read_parquet() only reading one parquet file, and gains a `as_tibble` argument
e936b44 <Romain Francois> need parquet on travis too
ff260c5 <Romain Francois> header was too commented, renamed to parquet.cpp
9e1897f <Romain Francois> styling etc ...
456c5d2 <Jeffrey Wong> read parquet files
22d89dd <Jeffrey Wong> hardcode -larrow and -lparquet
  • Loading branch information
jeffwong-nflx authored and wesm committed Jan 5, 2019
1 parent 66f0d39 commit 5723ada
Show file tree
Hide file tree
Showing 11 changed files with 131 additions and 48 deletions.
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,8 @@ matrix:
language: r
cache: packages
latex: false
env:
- ARROW_TRAVIS_PARQUET=1
before_install:
# Have to copy-paste this here because of how R's build steps work
- eval `python $TRAVIS_BUILD_DIR/ci/detect-changes.py`
Expand Down
2 changes: 2 additions & 0 deletions r/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Version: 0.11.0.9000
Authors@R: c(
person("Romain", "François", email = "[email protected]", role = c("aut", "cre")),
person("Javier", "Luraschi", email = "[email protected]", role = c("ctb")),
person("Jeffrey", "Wong", email = "[email protected]", role = c("ctb")),
person("Apache Arrow", email = "[email protected]", role = c("aut", "cph"))
)
Description: R Integration to 'Apache' 'Arrow'.
Expand Down Expand Up @@ -62,6 +63,7 @@ Collate:
'memory_pool.R'
'message.R'
'on_exit.R'
'parquet.R'
'read_record_batch.R'
'read_table.R'
'reexports-bit64.R'
Expand Down
1 change: 1 addition & 0 deletions r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ export(read_arrow)
export(read_csv_arrow)
export(read_feather)
export(read_message)
export(read_parquet)
export(read_record_batch)
export(read_schema)
export(read_table)
Expand Down
4 changes: 4 additions & 0 deletions r/R/RcppExports.R

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

33 changes: 33 additions & 0 deletions r/R/parquet.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

#' Read parquet file from disk
#'
#' @param file a file path
#' @param as_tibble should the [arrow::Table][arrow__Table] be converted to a tibble.
#' @param ... currently ignored
#'
#' @return a [arrow::Table][arrow__Table], or a data frame if `as_tibble` is `TRUE`.
#'
#' @export
read_parquet <- function(file, as_tibble = TRUE, ...) {
tab <- shared_ptr(`arrow::Table`, read_parquet_file(f))
if (isTRUE(as_tibble)) {
tab <- as_tibble(tab)
}
tab
}
2 changes: 1 addition & 1 deletion r/README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install
```

Expand Down
61 changes: 16 additions & 45 deletions r/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install
```

Expand All @@ -38,48 +38,19 @@ tf <- tempfile()
#> # A tibble: 10 x 2
#> x y
#> <int> <dbl>
#> 1 1 -0.255
#> 2 2 -0.162
#> 3 3 -0.614
#> 4 4 -0.322
#> 5 5 0.0693
#> 6 6 -0.920
#> 7 7 -1.08
#> 8 8 0.658
#> 9 9 0.821
#> 10 10 0.539
arrow::write_arrow(tib, tf)

# read it back with pyarrow
pa <- import("pyarrow")
as_tibble(pa$open_file(tf)$read_pandas())
#> # A tibble: 10 x 2
#> x y
#> <int> <dbl>
#> 1 1 -0.255
#> 2 2 -0.162
#> 3 3 -0.614
#> 4 4 -0.322
#> 5 5 0.0693
#> 6 6 -0.920
#> 7 7 -1.08
#> 8 8 0.658
#> 9 9 0.821
#> 10 10 0.539
```

## Development

### Code style

We use Google C++ style in our C++ code. Check for style errors with

```
./lint.sh
```

You can fix the style issues with

#> 1 1 0.0855
#> 2 2 -1.68
#> 3 3 -0.0294
#> 4 4 -0.124
#> 5 5 0.0675
#> 6 6 1.64
#> 7 7 1.54
#> 8 8 -0.0209
#> 9 9 -0.982
#> 10 10 0.349
# arrow::write_arrow(tib, tf)

# # read it back with pyarrow
# pa <- import("pyarrow")
# as_tibble(pa$open_file(tf)$read_pandas())
```
./lint.sh --fix
```
4 changes: 2 additions & 2 deletions r/configure
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,13 @@
# R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'

# Library settings
PKG_CONFIG_NAME="arrow"
PKG_CONFIG_NAME="arrow parquet"
PKG_DEB_NAME="arrow"
PKG_RPM_NAME="arrow"
PKG_CSW_NAME="arrow"
PKG_BREW_NAME="apache-arrow"
PKG_TEST_HEADER="<arrow/api.h>"
PKG_LIBS="-larrow"
PKG_LIBS="-larrow -lparquet"

# Use pkg-config if available
pkg-config --version >/dev/null 2>&1
Expand Down
21 changes: 21 additions & 0 deletions r/man/read_parquet.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 12 additions & 0 deletions r/src/RcppExports.cpp

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

37 changes: 37 additions & 0 deletions r/src/parquet.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>
#include <parquet/exception.h>

// [[Rcpp::export]]
std::shared_ptr<arrow::Table> read_parquet_file(std::string filename) {
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_THROW_NOT_OK(
arrow::io::ReadableFile::Open(filename, arrow::default_memory_pool(), &infile));

std::unique_ptr<parquet::arrow::FileReader> reader;
PARQUET_THROW_NOT_OK(
parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader));
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadTable(&table));

return table;
}

0 comments on commit 5723ada

Please sign in to comment.