Skip to content

data.table spark/databases interface #1828

@ysgit

Description

@ysgit

data.table is awesome but most people don't have 100GB memory in order to handle really large data sets in memory.

Big progress has been made making the Apache Spark framework available through R in the last couple of years. Two such projects are Apache's sparkr and Rstudio's sparklyr. Both of these provide a dplyr style interface to spark's data processing engine.

As a heavy data.table user it would be amazing if there were to be a data.table interface for spark. That would make it incredibly easy for data scientists to migrate their projects from the smaller CSV style data sets to the huge data sets that can be processed by spark.

A classic data pipeline for me is

  1. Bring the data into R by CSV
  2. Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table
  3. Build a model using one of R's many machine learning packages

I want to be able to migrate this to

  1. Connect to data on hadoop cluster
  2. Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table's spark interface
  3. Build a model using one of spark's many machine learning algorithm.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions