Execute R jobs on Hadoop cluster from R Studio IDE

With its richness of libraries and analytics focused design, R has proven itself popular with data scientists. Unfortunately, the size of the data set you can work on using R is limited by the memory and CPU of the machine executing R code. R Studio IDE in the Data Scientist Workbench can help you overcome these limitations by submitting your R program for execution on a large scale Hadoop cluster that can run your job in parallel on all of the machines in the cluster thereby taking advantage of greater processing power and memory to work on very large data sets.

It does so by using Big R, a library of functions that provide end-to-end integration with the R language and IBM Open Platform for Hadoop. Big R can be used for comprehensive data analysis on the Hadoop cluster, hiding some of the complexity of manually writing MapReduce jobs.

Data Scientist Workbench R Studio IDE comes Big R libraries pre-installed allowing you to send your R program for execution in a Hadoop cluster directly from the R Studio IDE.

Credentials

Big SQL Technology Sandbox is a large, shared cluster powered by Hadoop. You can use it to run R, SQL, Spark, and Hadoop jobs. It is a high performance environment demonstrating the advantages of parallelized processing of big data sets.

For credentials, sign up for an account on Demo Cloud. Your username and password there will be used for new SQL connections.

Run R jobs

Open R Studio IDE.

Enter your Demo Cloud username and password in a new cell.
user = "my_demo_cloud_username";

password = "my_demo_cloud_password"
Notice: Your Big SQL Technology Sandbox username is different from your email address. For example, the username for jane.doe@example.com might be janedoe. You can see your username in the top right corner of Demo Cloud when you're logged in.

Enter the other connection details for our cluster.
host = "iop-bi-master.imdemocloud.com"
Connect to the cluster.
library(bigr);
bigr.connect(host=host, user=user, password=password)
Verify the connection.
is.bigr.connected()
You should see TRUE if you're connected.

Cool, you're now connected!

Consult the Big R API reference to see examples of R jobs you can run.

Finally, as a best practice we should close the Big R connection once we're done with it.
bigr.disconnect()

Feedback and Knowledge Base