-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Labels
Description
data.table is awesome but most people don't have 100GB memory in order to handle really large data sets in memory.
Big progress has been made making the Apache Spark framework available through R in the last couple of years. Two such projects are Apache's sparkr and Rstudio's sparklyr. Both of these provide a dplyr style interface to spark's data processing engine.
As a heavy data.table user it would be amazing if there were to be a data.table interface for spark. That would make it incredibly easy for data scientists to migrate their projects from the smaller CSV style data sets to the huge data sets that can be processed by spark.
A classic data pipeline for me is
- Bring the data into R by CSV
- Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table
- Build a model using one of R's many machine learning packages
I want to be able to migrate this to
- Connect to data on hadoop cluster
- Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table's spark interface
- Build a model using one of spark's many machine learning algorithm.
Reactions are currently unavailable