Arrow: Difference between revisions
No edit summary |
No edit summary |
||
Line 2: | Line 2: | ||
<translate> | <translate> | ||
<!--T:1--> | <!--T:1--> | ||
[https://arrow.apache.org/ Apache Arrow] is a cross-language development platform for in-memory data. It | [https://arrow.apache.org/ Apache Arrow] is a cross-language development platform for in-memory data. It uses a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust. | ||
== CUDA == <!--T:2--> | == CUDA == <!--T:2--> | ||
Line 9: | Line 9: | ||
== Python bindings == <!--T:3--> | == Python bindings == <!--T:3--> | ||
The module contains bindings for multiple | The module contains bindings for multiple Python versions. | ||
To discover which are the compatible Python versions | To discover which are the compatible Python versions, run | ||
{{Command|module spider arrow/0.16.0}} | {{Command|module spider arrow/0.16.0}} | ||
=== PyArrow === <!--T:4--> | === PyArrow === <!--T:4--> | ||
The Arrow Python bindings (also named ''PyArrow'') have first-class integration with NumPy, | The Arrow Python bindings (also named ''PyArrow'') have first-class integration with NumPy, Pandas, and built-in Python objects. They are based on the C++ implementation of Arrow. | ||
<!--T:5--> | <!--T:5--> | ||
1. Load the required modules | 1. Load the required modules. | ||
{{Command|module load gcc/8.3.0 arrow/0.16.0 python/3.7 scipy-stack}} | {{Command|module load gcc/8.3.0 arrow/0.16.0 python/3.7 scipy-stack}} | ||
<!--T:6--> | <!--T:6--> | ||
2. Import PyArrow | 2. Import PyArrow. | ||
{{Command|python -c "import pyarrow"}} | {{Command|python -c "import pyarrow"}} | ||
<!--T:7--> | <!--T:7--> | ||
If the command displays nothing, the import was successful. | |||
<!--T:8--> | <!--T:8--> | ||
For more information, see the [https://arrow.apache.org/docs/python/ Arrow Python] documentation. | For more information, see the [https://arrow.apache.org/docs/python/ Arrow Python] documentation. | ||
==== Apache Parquet | ==== Apache Parquet format ==== <!--T:9--> | ||
The [http://parquet.apache.org/ Parquet] file format is available. | The [http://parquet.apache.org/ Parquet] file format is available. | ||
<!--T:10--> | <!--T:10--> | ||
To import | To import the Parquet module, execute the previous steps for <tt>pyarrow</tt>, then run | ||
{{Command|python -c "import pyarrow.parquet"}} | {{Command|python -c "import pyarrow.parquet"}} | ||
<!--T:11--> | <!--T:11--> | ||
If the command displays nothing, the import was successful. | |||
== R bindings == <!--T:12--> | == R bindings == <!--T:12--> | ||
The | The Arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for analyzing large, multi-file datasets ([https://arrow.apache.org/docs/r/reference/open_dataset.html open_dataset()]), working with individual Parquet files ([https://arrow.apache.org/docs/r/reference/read_parquet.html read_parquet()], [https://arrow.apache.org/docs/r/reference/write_parquet.html write_parquet()]) and Feather files ([https://arrow.apache.org/docs/r/reference/read_feather.html read_feather()], [https://arrow.apache.org/docs/r/reference/write_feather.html write_feather()]), as well as lower-level access to the Arrow memory and messages. | ||
=== Installation === <!--T:13--> | === Installation === <!--T:13--> | ||
1. Load the required modules | 1. Load the required modules. | ||
{{Command|module load gcc/8.3.0 arrow/0.16.0 r/3.6 boost/1.68.0}} | {{Command|module load gcc/8.3.0 arrow/0.16.0 r/3.6 boost/1.68.0}} | ||
<!--T:14--> | <!--T:14--> | ||
2. Specify the local installation directory | 2. Specify the local installation directory. | ||
{{Commands | {{Commands | ||
|mkdir -p ~/.local/R/$EBVERSIONR/ | |mkdir -p ~/.local/R/$EBVERSIONR/ | ||
Line 55: | Line 55: | ||
<!--T:15--> | <!--T:15--> | ||
3. Export the required variables to ensure | 3. Export the required variables to ensure you are using the system installation. | ||
{{Commands | {{Commands | ||
|export PKG_CONFIG_PATH{{=}}$EBROOTARROW/lib/pkgconfig | |export PKG_CONFIG_PATH{{=}}$EBROOTARROW/lib/pkgconfig | ||
Line 63: | Line 63: | ||
<!--T:16--> | <!--T:16--> | ||
4. Install the bindings | 4. Install the bindings. | ||
{{Command|R -e 'install.packages("arrow", repos{{=}}"https://cloud.r-project.org/")'}} | {{Command|R -e 'install.packages("arrow", repos{{=}}"https://cloud.r-project.org/")'}} | ||
=== Usage === <!--T:17--> | === Usage === <!--T:17--> | ||
After the bindings are installed, they have to be loaded. | |||
<!--T:18--> | <!--T:18--> | ||
1. Load the required modules | 1. Load the required modules. | ||
{{Command|module load gcc/8.3.0 arrow/0.16.0 r/3.6}} | {{Command|module load gcc/8.3.0 arrow/0.16.0 r/3.6}} | ||
<!--T:19--> | <!--T:19--> | ||
2. Load the library | 2. Load the library. | ||
{{Command | {{Command | ||
|R -e "library(arrow)" | |R -e "library(arrow)" | ||
Line 83: | Line 83: | ||
<!--T:20--> | <!--T:20--> | ||
For more information | For more information, see the [https://arrow.apache.org/docs/r/index.html Arrow R documentation] | ||
</translate> | </translate> |
Revision as of 23:23, 6 April 2020
Apache Arrow is a cross-language development platform for in-memory data. It uses a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
CUDA
Arrow is also available with CUDA.
[name@server ~]$ module load gcc/8.3.0 arrow/0.16.0 cuda/10.1
Python bindings
The module contains bindings for multiple Python versions. To discover which are the compatible Python versions, run
[name@server ~]$ module spider arrow/0.16.0
PyArrow
The Arrow Python bindings (also named PyArrow) have first-class integration with NumPy, Pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.
1. Load the required modules.
[name@server ~]$ module load gcc/8.3.0 arrow/0.16.0 python/3.7 scipy-stack
2. Import PyArrow.
[name@server ~]$ python -c "import pyarrow"
If the command displays nothing, the import was successful.
For more information, see the Arrow Python documentation.
Apache Parquet format
The Parquet file format is available.
To import the Parquet module, execute the previous steps for pyarrow, then run
[name@server ~]$ python -c "import pyarrow.parquet"
If the command displays nothing, the import was successful.
R bindings
The Arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for analyzing large, multi-file datasets (open_dataset()), working with individual Parquet files (read_parquet(), write_parquet()) and Feather files (read_feather(), write_feather()), as well as lower-level access to the Arrow memory and messages.
Installation
1. Load the required modules.
[name@server ~]$ module load gcc/8.3.0 arrow/0.16.0 r/3.6 boost/1.68.0
2. Specify the local installation directory.
[name@server ~]$ mkdir -p ~/.local/R/$EBVERSIONR/
[name@server ~]$ export R_LIBS=~/.local/R/$EBVERSIONR/
3. Export the required variables to ensure you are using the system installation.
[name@server ~]$ export PKG_CONFIG_PATH=$EBROOTARROW/lib/pkgconfig
[name@server ~]$ export INCLUDE_DIR=$EBROOTARROW/include
[name@server ~]$ export LIB_DIR=$EBROOTARROW/lib
4. Install the bindings.
[name@server ~]$ R -e 'install.packages("arrow", repos="https://cloud.r-project.org/")'
Usage
After the bindings are installed, they have to be loaded.
1. Load the required modules.
[name@server ~]$ module load gcc/8.3.0 arrow/0.16.0 r/3.6
2. Load the library.
[name@server ~]$ R -e "library(arrow)"
> library("arrow")
Attaching package: ‘arrow’
For more information, see the Arrow R documentation