Arrow: Difference between revisions

Revision as of 18:02, 26 March 2020

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

CUDA

Arrow is also available with CUDA.

[name@server ~]$ module load gcc/8.3.0 arrow/0.16.0 cuda/10.1

Python bindings

The module contains bindings for multiple python versions. To discover which are the compatible Python versions:

[name@server ~]$ module spider arrow/0.16.0

PyArrow

The Arrow Python bindings (also named PyArrow) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.

1. Load the required modules:

[name@server ~]$ module load gcc/8.3.0 arrow/0.16.0 python/3.7 scipy-stack

2. Import PyArrow

[name@server ~]$ python -c "import pyarrow"

The command display nothing. You have successfully imported PyArrow.

For more information, see the Arrow Python documentation.

Apache Parquet Format

The Parquet file format is available.

To import it, execute previous steps for pyarrow, then :

[name@server ~]$ python -c "import pyarrow.parquet"

The command display nothing. You have successfully imported the Parquet module.

R bindings

The arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for analyzing large, multi-file datasets (open_dataset()), working with individual Parquet (read_parquet(), write_parquet()) and Feather (read_feather(), write_feather()) files, as well as lower-level access to Arrow memory and messages.

Installation

1. Load the required modules:

[name@server ~]$ module load gcc/8.3.0 arrow/0.16.0 r/3.6 boost/1.68.0

2. Specify the local installation directory:

[name@server ~]$ mkdir -p ~/.local/R/$EBVERSIONR/
[name@server ~]$ export R_LIBS=~/.local/R/$EBVERSIONR/

3. Export the required variables to ensure we are using the system installation:

[name@server ~]$ export PKG_CONFIG_PATH=$EBROOTARROW/lib/pkgconfig
[name@server ~]$ export INCLUDE_DIR=$EBROOTARROW/include
[name@server ~]$ export LIB_DIR=$EBROOTARROW/lib

4. Install the bindings

[name@server ~]$ R -e 'install.packages("arrow", repos="https://cloud.r-project.org/")'

Usage

Once installed, you can load the bindings.

1. Load the required modules:

[name@server ~]$ module load gcc/8.3.0 arrow/0.16.0 r/3.6

2. Load the library:

[name@server ~]$ R -e "library(arrow)"
> library("arrow")
Attaching package: ‘arrow’

For more information on its usage, see Arrow R documentation

@@ Line 1: / Line 1: @@
 <translate>
+<!--T:1-->
 [https://arrow.apache.org/ Apache Arrow] is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
-== CUDA ==
+== CUDA == <!--T:2-->
 Arrow is also available with CUDA.
 {{Command|module load gcc/8.3.0 arrow/0.16.0 cuda/10.1}}
-== Python bindings ==
+== Python bindings == <!--T:3-->
 The module contains bindings for multiple python versions.
 To discover which are the compatible Python versions:
 {{Command|module spider arrow/0.16.0}}
-=== PyArrow ===
+=== PyArrow === <!--T:4-->
 The Arrow Python bindings (also named ''PyArrow'') have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.
+<!--T:5-->
 . Load the required modules:
 {{Command|module load gcc/8.3.0 arrow/0.16.0 python/3.7 scipy-stack}}
+<!--T:6-->
 . Import PyArrow
 {{Command|python -c "import pyarrow"}}
+<!--T:7-->
 The command display nothing. You have successfully imported PyArrow.
+<!--T:8-->
 For more information, see the [https://arrow.apache.org/docs/python/ Arrow Python] documentation.
-==== Apache Parquet Format ====
+==== Apache Parquet Format ==== <!--T:9-->
 The [http://parquet.apache.org/ Parquet] file format is available.
+<!--T:10-->
 To import it, execute previous steps for <tt>pyarrow</tt>, then :
 {{Command|python -c "import pyarrow.parquet"}}
+<!--T:11-->
 The command display nothing. You have successfully imported the Parquet module.
-== R bindings ==
+== R bindings == <!--T:12-->
 The arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for analyzing large, multi-file datasets ([https://arrow.apache.org/docs/r/reference/open_dataset.html open_dataset()]), working with individual Parquet ([https://arrow.apache.org/docs/r/reference/read_parquet.html read_parquet()], [https://arrow.apache.org/docs/r/reference/write_parquet.html write_parquet()]) and Feather ([https://arrow.apache.org/docs/r/reference/read_feather.html read_feather()], [https://arrow.apache.org/docs/r/reference/write_feather.html write_feather()]) files, as well as lower-level access to Arrow memory and messages.
-=== Installation ===
+=== Installation === <!--T:13-->
 . Load the required modules:
 {{Command|module load gcc/8.3.0 arrow/0.16.0 r/3.6 boost/1.68.0}}
+<!--T:14-->
 . Specify the local installation directory:
 {{Commands
@@ Line 45: / Line 53: @@
 }}
+<!--T:15-->
 . Export the required variables to ensure we are using the system installation:
 {{Commands
@@ Line 52: / Line 61: @@
 }}
+<!--T:16-->
 . Install the bindings
 {{Command|R -e 'install.packages("arrow", repos{{=}}"https://cloud.r-project.org/")'}}
-=== Usage ===
+=== Usage === <!--T:17-->
 Once installed, you can load the bindings.
+<!--T:18-->
 . Load the required modules:
 {{Command|module load gcc/8.3.0 arrow/0.16.0 r/3.6}}
+<!--T:19-->
 . Load the library:
 {{Command
@@ Line 69: / Line 81: @@
 }}
+<!--T:20-->
 For more information on its usage, see [https://arrow.apache.org/docs/r/index.html Arrow R documentation]
 </translate>

Arrow: Difference between revisions

Revision as of 18:02, 26 March 2020

Contents

CUDA

Python bindings

PyArrow

Apache Parquet Format

R bindings

Installation

Usage

Navigation menu

Arrow: Difference between revisions

Revision as of 18:02, 26 March 2020

CUDA

Python bindings

PyArrow

Apache Parquet Format

R bindings

Installation

Usage

Navigation menu

Search