Large Scale Machine Learning (Big Data): Difference between revisions

Jump to navigation Jump to search
no edit summary
(remove Compute Canada)
No edit summary
Line 4: Line 4:


<!--T:1-->
<!--T:1-->
In the field of Deep Learning, the widespread use of mini-batching strategies along with first-order iterative solvers makes most common training tasks naturally scalable to large quantities of data. Whether you are looking at training Deep Neural Networks on a few thousand examples, or hundreds of millions of them, the flow of your code will look pretty much the same: load a few examples from a target source (from disk, from memory, from a remote source...) and iterate through them, computing gradients and using them to update the parameters of the model as you go. Conversely, in many Traditional Machine Learning packages - notably <code>scikit-learn</code> - scaling your code to train on very large datasets is often not trivial. Many algorithms that fit common models such as Generalized Linear Models (GLMs) and Support Vector Machines (SVMs) for example, may have default implementations that require the entire training set to be loaded in memory and often do not leverage any manner of thread or process parallelism. Some of these implementations also rely on memory-intensive solvers, which may require several times the size of your training set's worth of memory to work properly.
In the field of Deep Learning, the widespread use of mini-batching strategies along with first-order iterative solvers makes most common training tasks naturally scalable to large quantities of data. Whether you are looking at training Deep Neural Networks on a few thousand examples, or hundreds of millions of them, the flow of your code will look pretty much the same: load a few examples from a target source (from disk, from memory, from a remote source...) and iterate through them, computing gradients and using them to update the parameters of the model as you go. Conversely, in many Traditional Machine Learning packages --notably <code>scikit-learn</code>-- scaling your code to train on very large datasets is often not trivial. Many algorithms that fit common models such as Generalized Linear Models (GLMs) and Support Vector Machines (SVMs) for example, may have default implementations that require the entire training set to be loaded in memory and often do not leverage any manner of thread or process parallelism. Some of these implementations also rely on memory-intensive solvers, which may require several times the size of your training set's worth of memory to work properly.


<!--T:2-->
<!--T:2-->
This page covers options to scale out Traditional Machine Learning methods to very large datasets. Whether your training workload is too massive to fit even in a Large Memory node, or just big enough to take a really long time to process serially, the sections that follow may provide some insights to help you train models on Big Data.
This page covers options to scale out traditional machine learning methods to very large datasets. Whether your training workload is too massive to fit even in a large memory node, or just big enough to take a really long time to process serially, the sections that follow may provide some insights to help you train models on Big Data.


=Scikit-Learn= <!--T:3-->
=Scikit-learn= <!--T:3-->


<!--T:4-->
<!--T:4-->
[https://scikit-learn.org/stable/index.html Scikit-learn] is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. This popular package features an intuitive API that makes building fairly complex Machine Learning pipelines very straightforward. However, many of its implementations of common methods such as GLMs and SVMs assume that the entire training set can be loaded in memory, which might be a show-stopper when dealing with massive datasets. Furthermore, some of these algorithms opt for memory-intensive solvers by default. You can avoid these limitations, in some cases, using the ideas that follow.
[https://scikit-learn.org/stable/index.html Scikit-learn] is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. This popular package features an intuitive API that makes building fairly complex machine learning pipelines very straightforward. However, many of its implementations of common methods such as GLMs and SVMs assume that the entire training set can be loaded in memory, which might be a showstopper when dealing with massive datasets. Furthermore, some of these algorithms opt for memory-intensive solvers by default. In some cases, you can avoid these limitations using the ideas that follow.


==Stochastic Gradient Solvers== <!--T:5-->
==Stochastic gradient solvers== <!--T:5-->


<!--T:6-->
<!--T:6-->
rsnt_translations
56,426

edits

Navigation menu