data.table spark/databases interface

data.table is awesome but most people don't have 100GB memory in order to handle really large data sets in memory.

Big progress has been made making the Apache Spark framework available through R in the last couple of years. Two such projects are [Apache's sparkr](https://spark.apache.org/docs/latest/sparkr.html) and [Rstudio's sparklyr](http://spark.rstudio.com/). Both of these provide a dplyr style interface to spark's data processing engine.

As a heavy data.table user it would be amazing if there were to be a data.table interface for spark. That would make it incredibly easy for data scientists to migrate their projects from the smaller CSV style data sets to the huge data sets that can be processed by spark.

A classic data pipeline for me is
1. Bring the data into R by CSV
2. Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table
3. Build a model using one of R's many machine learning packages

I want to be able to migrate this to
1. Connect to data on hadoop cluster
2. Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table's spark interface
3. Build a model using one of spark's many machine learning algorithm.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table spark/databases interface #1828

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

data.table spark/databases interface #1828

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions