Submitting Jobs to Spark service on Bluemix

Apache Spark is an open source cluster computing framework optimized for extremely fast and large scale data processing. Data Scientist Workbench provides a local Spark environment in Jupyter notebooks. This local Spark environment is intended for learning and testing purposes with small data sets.

Users intending to utilize Spark for production or more intensive workloads, e.g. involving very large data sets, are encouraged to utilize the managed Spark service on IBM Bluemix. This Spark service allows for fast in-memory analytics on large data sets and is available as a pay-as-you-go service on Bluemix. New users can avail a time limited free trial.

Jupyter notebooks within Data Scientist Workbench allow you to execute Spark jobs using spark-submit on remote clusters such as the Spark service on Bluemix.

To submit your notebooks for execution on the Spark service on Bluemix, you will first need a running instance of the service and credentials to access it. Instructions for this can be found in this knowledgebase article.

The data set to be analyzed should be accessible by the Spark service on Bluemix. This is typically done by placing the data on the Bluemix Object Storage, an instance of which is automatically provisioned when you create an instance of the Spark service.

To get started open the notebook you have prepared for execution on the remote Spark cluster. For illustration in this article we will use the Improved Flight Delay Prediction tutorial included included on the Welcome page of Jupyter notebooks in DSWB: 



Next, towards the top-right corner you will see the Recent Notebooks tab, which lists all recently opened notebooks. Expand the twistie for the notebook you are interested in and click on the Submit to Spark link:



This will open a new notebook called Submit-to-Spark-ClusterN.ipynb. Scroll down to the cell that has the credentials and code for submitting the notebook to Spark in Bluemix. Replace the credentials with the credentials for your instance of Spark service in Bluemix:



Further down in the same cell is the call for submitting the notebook using spark-submit. Insert the path of the notebook to be submitted to the spark service.



The execution time can vary depending on the size of the data set to be analyzed, complexity of your model and the amount of resources available to your instance of Spark service. The tutorial notebook in this example can take 5-15 minutes on a default size instance. You can see progress in the output of the cell:


Converting notebook to .py file ... [NbConvertApp] Converting notebook /resources/FlightDelay_demo2_bluemix.ipynb to python [NbConvertApp] Writing 11973 bytes to /tmp/FlightDelay_demo2_bluemix-2016_08_11_19_14_32.py Converting notebook to .py file ... Done ! Preparing Python code for submission to Bluemix Spark Bluemix credentials are present Tools for submitting to Bluemix Spark are present Submit job to Bluemix Spark Service ... To see the log, in another terminal window run the following command: tail -f spark-submit_1470942874134641685.log Uploading /tmp/FlightDelay_demo2_bluemix-2016_08_11_19_14_32.py % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 12263 0 137 100 12126 20 1837 0:00:06 0:00:06 --:--:-- 2967 Submitting Job

...

Submission complete. Log file: spark-submit_1470942874134641685.log Job execution complete To see the output, you can run !cat on the new stdout fileOnce the execution of the notebook on the remote Spark cluster has completed you can view the results 
To view the results of execution print the contents of the stdout file which can be found under the Recent Data on the right side:



Good luck!

Feedback and Knowledge Base