Arrow: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Marked this version for translation)
 
(22 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[https://arrow.apache.org/ Apache Arrow] is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
<languages />
[[Category:Software]]


== CUDA ==
<translate>
<!--T:1-->
[https://arrow.apache.org/ Apache Arrow] is a cross-language development platform for in-memory data. It uses a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
 
== CUDA == <!--T:2-->
Arrow is also available with CUDA.
Arrow is also available with CUDA.
{{Command|module load gcc/8.3.0 arrow/0.16.0 cuda/10.1}}
{{Command|module load gcc arrow/X.Y.Z cuda}}
where X.Y.Z represent the desired version.


== Python bindings ==
== Python bindings == <!--T:3-->
The module contains bindings for multiple python versions.  
The module contains bindings for multiple Python versions.  
To discover which are the compatible Python versions:
To discover which are the compatible Python versions, run
{{Command|module spider arrow/0.16.0}}
{{Command|module spider arrow/X.Y.Z}}
where <tt>X.Y.Z</tt> represent the desired version.


=== PyArrow ===
<!--T:22-->
The Arrow Python bindings (also named ''PyArrow'') have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.
Or search directly ''pyarrow'', by running
{{Command|module spider pyarrow}}


1. Load the required modules:
=== PyArrow === <!--T:4-->
{{Command|module load gcc/8.3.0 arrow/0.16.0 python/3.7 scipy-stack}}
The Arrow Python bindings (also named ''PyArrow'') have first-class integration with NumPy, Pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.


2. Import PyArrow
<!--T:5-->
1. Load the required modules.
{{Command|module load gcc arrow/X.Y.Z python/3.11}}
where <tt>X.Y.Z</tt> represent the desired version.
 
<!--T:6-->
2. Import PyArrow.
{{Command|python -c "import pyarrow"}}
{{Command|python -c "import pyarrow"}}


The command display nothing. You have successfully imported PyArrow.
<!--T:7-->
If the command displays nothing, the import was successful.


<!--T:8-->
For more information, see the [https://arrow.apache.org/docs/python/ Arrow Python] documentation.
For more information, see the [https://arrow.apache.org/docs/python/ Arrow Python] documentation.


==== Apache Parquet Format ====
==== Fulfilling other Python package dependency ==== <!--T:21-->
Other Python packages depends on PyArrow in order to be installed.
With the <code>arrow</code> module loaded, your package dependency for <code>pyarrow</code> will be satisfied.
{{Command
|pip list {{!}} grep pyarrow
|result=
pyarrow    17.0.0
}}
 
==== Apache Parquet format ==== <!--T:9-->
The [http://parquet.apache.org/ Parquet] file format is available.  
The [http://parquet.apache.org/ Parquet] file format is available.  


To import it, execute previous steps for <tt>pyarrow</tt>, then :
<!--T:10-->
To import the Parquet module, execute the previous steps for <code>pyarrow</code>, then run
{{Command|python -c "import pyarrow.parquet"}}
{{Command|python -c "import pyarrow.parquet"}}


The command display nothing. You have successfully imported the Parquet module.
<!--T:11-->
If the command displays nothing, the import was successful.


== R bindings ==
== R bindings == <!--T:12-->
The arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for analyzing large, multi-file datasets ([https://arrow.apache.org/docs/r/reference/open_dataset.html open_dataset()]), working with individual Parquet ([https://arrow.apache.org/docs/r/reference/read_parquet.html read_parquet()], [https://arrow.apache.org/docs/r/reference/write_parquet.html write_parquet()]) and Feather ([https://arrow.apache.org/docs/r/reference/read_feather.html read_feather()], [https://arrow.apache.org/docs/r/reference/write_feather.html write_feather()]) files, as well as lower-level access to Arrow memory and messages.
The Arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for analyzing large, multi-file datasets ([https://arrow.apache.org/docs/r/reference/open_dataset.html open_dataset()]), working with individual Parquet files ([https://arrow.apache.org/docs/r/reference/read_parquet.html read_parquet()], [https://arrow.apache.org/docs/r/reference/write_parquet.html write_parquet()]) and Feather files ([https://arrow.apache.org/docs/r/reference/read_feather.html read_feather()], [https://arrow.apache.org/docs/r/reference/write_feather.html write_feather()]), as well as lower-level access to the Arrow memory and messages.


=== Installation ===
=== Installation === <!--T:13-->
1. Load the required modules:
1. Load the required modules.
{{Command|module load gcc/8.3.0 arrow/0.16.0 r/3.6 boost/1.68.0}}
{{Command|module load StdEnv/2020 gcc/9.3.0 arrow/8 r/4.1 boost/1.72.0}}


2. Specify the local installation directory:
<!--T:14-->
2. Specify the local installation directory.
{{Commands
{{Commands
|mkdir -p ~/.local/R/$EBVERSIONR/
|mkdir -p ~/.local/R/$EBVERSIONR/
Line 44: Line 72:
}}
}}


3. Export the required variables to ensure we are using the system installation:
<!--T:15-->
3. Export the required variables to ensure you are using the system installation.
{{Commands
{{Commands
|export PKG_CONFIG_PATH{{=}}$EBROOTARROW/lib/pkgconfig
|export PKG_CONFIG_PATH{{=}}$EBROOTARROW/lib/pkgconfig
Line 51: Line 80:
}}
}}


4. Install the bindings
<!--T:16-->
4. Install the bindings.
{{Command|R -e 'install.packages("arrow", repos{{=}}"https://cloud.r-project.org/")'}}
{{Command|R -e 'install.packages("arrow", repos{{=}}"https://cloud.r-project.org/")'}}


=== Usage ===
=== Usage === <!--T:17-->
Once installed, you can load the bindings.
After the bindings are installed, they have to be loaded.


1. Load the required modules:
<!--T:18-->
{{Command|module load gcc/8.3.0 arrow/0.16.0 r/3.6}}
1. Load the required modules.
{{Command|module load StdEnv/2020 gcc/9.3.0 arrow/8 r/4.1}}


2. Load the library:
<!--T:19-->
2. Load the library.
{{Command
{{Command
|R -e "library(arrow)"
|R -e "library(arrow)"
Line 68: Line 100:
}}
}}


For more information on its usage, see [https://arrow.apache.org/docs/r/index.html Arrow R documentation]
<!--T:20-->
For more information, see the [https://arrow.apache.org/docs/r/index.html Arrow R documentation]
</translate>

Latest revision as of 15:12, 16 July 2024

Other languages:

Apache Arrow is a cross-language development platform for in-memory data. It uses a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

CUDA

Arrow is also available with CUDA.

Question.png
[name@server ~]$ module load gcc arrow/X.Y.Z cuda

where X.Y.Z represent the desired version.

Python bindings

The module contains bindings for multiple Python versions. To discover which are the compatible Python versions, run

Question.png
[name@server ~]$ module spider arrow/X.Y.Z

where X.Y.Z represent the desired version.

Or search directly pyarrow, by running

Question.png
[name@server ~]$ module spider pyarrow

PyArrow

The Arrow Python bindings (also named PyArrow) have first-class integration with NumPy, Pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.

1. Load the required modules.

Question.png
[name@server ~]$ module load gcc arrow/X.Y.Z python/3.11

where X.Y.Z represent the desired version.

2. Import PyArrow.

Question.png
[name@server ~]$ python -c "import pyarrow"

If the command displays nothing, the import was successful.

For more information, see the Arrow Python documentation.

Fulfilling other Python package dependency

Other Python packages depends on PyArrow in order to be installed. With the arrow module loaded, your package dependency for pyarrow will be satisfied.

Question.png
[name@server ~]$ pip list | grep pyarrow
pyarrow    17.0.0

Apache Parquet format

The Parquet file format is available.

To import the Parquet module, execute the previous steps for pyarrow, then run

Question.png
[name@server ~]$ python -c "import pyarrow.parquet"

If the command displays nothing, the import was successful.

R bindings

The Arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for analyzing large, multi-file datasets (open_dataset()), working with individual Parquet files (read_parquet(), write_parquet()) and Feather files (read_feather(), write_feather()), as well as lower-level access to the Arrow memory and messages.

Installation

1. Load the required modules.

Question.png
[name@server ~]$ module load StdEnv/2020 gcc/9.3.0 arrow/8 r/4.1 boost/1.72.0

2. Specify the local installation directory.

[name@server ~]$ mkdir -p ~/.local/R/$EBVERSIONR/
[name@server ~]$ export R_LIBS=~/.local/R/$EBVERSIONR/


3. Export the required variables to ensure you are using the system installation.

[name@server ~]$ export PKG_CONFIG_PATH=$EBROOTARROW/lib/pkgconfig
[name@server ~]$ export INCLUDE_DIR=$EBROOTARROW/include
[name@server ~]$ export LIB_DIR=$EBROOTARROW/lib


4. Install the bindings.

Question.png
[name@server ~]$ R -e 'install.packages("arrow", repos="https://cloud.r-project.org/")'

Usage

After the bindings are installed, they have to be loaded.

1. Load the required modules.

Question.png
[name@server ~]$ module load StdEnv/2020 gcc/9.3.0 arrow/8 r/4.1

2. Load the library.

Question.png
[name@server ~]$ R -e "library(arrow)"
> library("arrow")
Attaching package: ‘arrow’

For more information, see the Arrow R documentation