Import Spark Runner code by tomwhite · Pull Request #37 · apache/beam

tomwhite · 2016-03-10T11:30:09Z

This addresses https://issues.apache.org/jira/browse/BEAM-6.

I've preserved git history (using @mxm's amazing git rewriting trick from #12). This is just an initial import - the Spark runner build is not yet integrated with the main build, packages need changing, etc. That's going to take more work, so it might be a good idea to get this merged first.

more checkstyle improvements

…them explicitly

… 4.12, Spark 1.2; Add source, javadoc plugins and other info; Fix javadoc errors and a few typos

… outside contributions.

The primary change needed to accomodate the new dataflow api is to how we handle side inputs.

FileOutputFormats to be used with Spark Dataflow, as long as they implement the ShardNameTemplateAware interface. This is easily achieved by subclassing the desired FileOutputFormat class, see TemplatedSequenceFileOutputFormat for an example.

Support was added in Spark 1.5.0 for user exception propagation, see https://issues.apache.org/jira/browse/SPARK-8625. Fixes https://github.com/cloudera/spark-dataflow/issues/69

@tomwhite

Add support for application name and streaming (default: false) Add pipeline options for streaming Add print output as an unbounded write Add default window strategy to represent Spark streaming micro-batches as fixed windows This translator helps to translate Dataflow transformations into Spark (+streaming) transformations. This will help to support streaming transformations separately Expose through the SparkPipelineTranslator Now Evaluator uses SparkPipelineTranslator to translate Add default application name StreamingEvaluation context to support DStream evaluation. Expose members and methods in EvaluationContext for inheritors Use configured app name in options A TransformTranslator for streaming Add support for spark streaming execution in the runner Fix comment Create input stream from a queue - mainly for testing I guess Add support to create input stream from queued values Override method to expose in package Test WordCount in streaming, just print out for now.. Stream print to console is a transformations of PCollection to PDone rename to CreateStream to differ from Dataflow Create It seems that in 1.3.1 short living streaming jobs fail (like unit tests). Maybe has something to do with SPARK-7930. fixed in 1.4.0 so bumped up. Expose some methods, add a method to check if RDDHolder exists make context final Streaming default should be local[1] to suppport unit tests No need for recurring context. Exposing additional parent methods. Added RUNNING state when stream is running. WordCount test runs 1 (sec) interval and compares to expected like in batch. Void Create triggers a no-input transformation transformations and output operations can be applied on streams/bounded collections in the pipeline foreachRDD is used for PDone transformation Commments SocketIO to consume stream from socket Comment Add support for Kafka input Comments and some patching-up Default is the same as in SparkPipelineOptions Adding licenses To satisfy license Javadoc and codestyle Satisfy license Javadoc and codestyle Check for DataflowAssertFailure because it won't propagate Since DataflowAssert doesn't propagate failures in streaming, use Aggregators to assert Use DataflowAssertStreaming Add kafka translation Embedded Kafka for unit test Kafka unit test import order license WindowingHelpers by Tom White @tomwhite Combine @tomwhite windowing branch into mine - values are windowed values now values are windowed values now Input is UNBOUNDED now Using windowing instead batchInterval to be determined by pipeline runner print the value not the windowed value remove support for for optimizations. for now. batchInterval is determined by the pipeline runner now Add streaming window pipeline visitor to determine windowing Add windowing support in streaming unit tests Combine.Globally is necessary so leave it fix line length renames Add implementation for GroupAlsoByWindow which helps to solve broken grouped/combinePerKey Line indentation unused codestyle Expose runtimeContext Make public Use the smallest window found (fixed/sliding) as the batch duration Make FieldGetter public Add support for windowing codestyle unused Update Spark to 1.5, kafka dependency should be provided Abstract Evaluator for common evaluator code. doVisitTransform per implementation. Added non-streaming windowing test by Tom White @tomwhite Fixed Combine.GroupedValues and Combine.Globally to work with WindowedValues without losing window properties. For now, Combine.PerKey is commented out until fixed to fully support WindowedValues. Support WindowedValues, Global or not, in Combine.PerKey After changes made to Combine.PerKey in 3a46150 it seems that the order has changed. Since ordere didn't seem relevant before the change, I don't see a reason not to change the expected value accordingly. Update Spark version to 1.5.2

Wrong packcage utils

…further untangle some generics issues. Update plugins. Fix some minor code issues from inspection.

davorbonaci · 2016-03-10T17:14:46Z

R: @davorbonaci

I'll take a quick peek.

amitsela · 2016-03-10T17:22:53Z

@davorbonaci feel free to merge this. I'll take care of integrating per https://issues.apache.org/jira/browse/BEAM-11

davorbonaci · 2016-03-10T17:30:25Z

Nice!

I'd probably get rid of LICENCE and CONTRIBUTING.md right away, and prefix the pull request with [BEAM-6] Import....

I can merge this right away -- no issues there. Just to confirm -- both of you should have commit/write access to the project. Is that not the case?

amitsela · 2016-03-10T17:40:23Z

Supposedly - but you're right, it'll be a good idea to test that.. Let me do the honors ;)

davorbonaci · 2016-03-10T20:59:59Z

Awesome. Thanks @amitsela and @tomwhite.

mxm · 2016-03-11T09:32:49Z

Glad the snippet could help you out @tomwhite. Nice to see this going in!

Issue apache#37

Fix Runner categories in tests Add streaming unit tests and corresponding labels issue apache#37 Update numEvents: results are no more linked to the number of events issue apache#22

Fix Runner categories in tests Add streaming unit tests and corresponding labels issue #37 Update numEvents: results are no more linked to the number of events issue #22

Use Read -> Impulse override utilities

* Make check-links script more reliable * Fix typos in links

* feat: make finish partition action idempotent * feat: make child partition action idempotent * fix: fix insert query in partition metadata dao * chore: spotless apply * refactor: catch error on child partition action Rely on already exists error to skip inserting a partition in the child partition record instead of checking if the key exists. We reduce the number of calls by doing this. * refactor: catch error on finish partition action Rely on catching an exception with a specific code to make the finish partition action idempotent. * refactor: removes unused dao method

Josh Wills and others added 30 commits March 10, 2016 11:14

Initial commit

ad44447

Dumbest proof of concept possible

845a817

First bit of work to get this running against the new Dataflow API

cb7c866

Update version of dataflow to get new API method access

dce03e4

Add support for getters and a Flatten impl

08e94b2

Such code. Much features.

9fdac6c

Adding some more operators: toiterable, seqdo

deca2c0

Support for ParDo.BoundMulti

6ee38b2

Fix bug in deserializing side inputs

64c6d8d

Add SparkRuntimeContext for handling shared runtime objects

40adbec

First cut at aggregators

565509d

First minimally working aggregators

bb219d4

Updates for 141206 SDK release

9152769

Dummy impls of windowing-related ProcContext functions

3bd04ae

Simplify pom.xml

ba74f19

Add proper coder handling to RDD retrieval

6aa08e0

Refactor aggregation related classes.

45be508

Add README.md and update project version in pom.xml.

67cf364

Adds Javadoc and Tests to project.

137d54a

Add apache2 license and cloudera copyright.

b954589

Adds custom checkstyle.

1523ffd

more checkstyle improvements

Factor out tranform translation logic in to its own class.

2992838

Specify and rationalize generic types in State, CoderHelpers to start

2e3fe1a

Add simple word count test.

7489263

Factor out spark pipeline options.

f9e8fab

Miscellaneous inspection changes from IntelliJ

ec172ba

Issue apache#13 : attempt to remove all generics warnings, or handle …

ed1e2f7

…them explicitly

Update and specify POM plugin config; Update Spark to 1.1.1, JUnit to…

225f6c0

… 4.12, Spark 1.2; Add source, javadoc plugins and other info; Fix javadoc errors and a few typos

Improve readme to explain current state of the repo, and to encourage…

1f9cd04

… outside contributions.

Update version of dataflow we depend on.

ba4b326

The primary change needed to accomodate the new dataflow api is to how we handle side inputs.

tomwhite and others added 14 commits March 10, 2016 11:15

Add NullWritableCoder and test.

8762b26

[maven-release-plugin] prepare release spark-dataflow-0.4.2

b8949b8

[maven-release-plugin] prepare for next development iteration

ecc33d8

Update README to latest version (0.4.2).

90c49b4

Add tests for Spark 1.4 / 1.5 in Travis

8779701

Fix a few Coverity inspection results plus more IntelliJ results

1c603d1

Propagate user exceptions thrown in DoFns.

22331d1

Support was added in Spark 1.5.0 for user exception propagation, see https://issues.apache.org/jira/browse/SPARK-8625. Fixes https://github.com/cloudera/spark-dataflow/issues/69

The example needs --inputFile, not --input, to designate the input file

f930380

Add support for Flattenning (union) PCollections and test

3478730

Wrong packcage utils

Upgrade to latest SDK version 1.3.0

a9168bf

Try to clean up some build warnings, related to generics, and try to …

89a21ca

…further untangle some generics issues. Update plugins. Fix some minor code issues from inspection.

First wave of changes from feedback

1229b00

asfgit merged commit 1229b00 into apache:master Mar 10, 2016

asfgit pushed a commit that referenced this pull request Mar 10, 2016

This closes #37

b2b5f42

echauchot added a commit to echauchot/beam that referenced this pull request May 12, 2017

Add streaming unit tests

4475649

Issue apache#37

asfgit pushed a commit that referenced this pull request Aug 23, 2017

Improve queries tests

7ef49dc

Fix Runner categories in tests Add streaming unit tests and corresponding labels issue #37 Update numEvents: results are no more linked to the number of events issue #22

lukecwik referenced this pull request in lukecwik/incubator-beam Mar 27, 2018

Merge pull request #37 from bsidhom/cleanup

2ba3816

Use Read -> Impulse override utilities

tvalentyn pushed a commit to tvalentyn/beam that referenced this pull request May 15, 2018

This closes apache#37

1bbf086

robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Apr 30, 2020

Feature/check links (apache#37)

8a909de

* Make check-links script more reliable * Fix typos in links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Spark Runner code#37

Import Spark Runner code#37
asfgit merged 137 commits intoapache:masterfrom
tomwhite:beam-6-import-spark-runner

tomwhite commented Mar 10, 2016

Uh oh!

davorbonaci commented Mar 10, 2016

Uh oh!

amitsela commented Mar 10, 2016

Uh oh!

davorbonaci commented Mar 10, 2016

Uh oh!

amitsela commented Mar 10, 2016

Uh oh!

davorbonaci commented Mar 10, 2016

Uh oh!

mxm commented Mar 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

tomwhite commented Mar 10, 2016

Uh oh!

davorbonaci commented Mar 10, 2016

Uh oh!

amitsela commented Mar 10, 2016

Uh oh!

davorbonaci commented Mar 10, 2016

Uh oh!

amitsela commented Mar 10, 2016

Uh oh!

davorbonaci commented Mar 10, 2016

Uh oh!

mxm commented Mar 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants