Access dashDB (or DB2) using sparkR from R notebook


You can leverage the power of Spark (specifically sparkR in a R notebook) to access, manipulate and analyze data from relational databases. SparkR allows you to work with extremely large datasets. Using SparkR, you can manipulate your data via SQL queries, or via SparkR's native commands.

In this example we will use IBM dashDB, a fully managed cloud data warehouse purpose-built for analytics, and explore how to connect to it from a R notebook and perform data access using sparkR dataframes.

Database connectivity will be established using JDBC (in another article we covered how to connect to a dashDB database using RJDBC).
If you don't already have an instance of dashDB already you can get one for free by following steps in this article.

Let's start by creating a new Jupyter notebook in Data Scientist Workbench. Set it to use R.



Create a new cell and load the RJDBC package:
library(RJDBC)
Next, initialize a SQL Context:
sqlContext <- sparkRSQL.init(sc)
Enter the other database connection credentials for your instance of dashDB (or DB2 database) and run the cell.
#Enter the values for you database connection
dsn_hostname = "<Enter Hostname>" # e.g.: "bluemix05.bluforcloud.com"
dsn_port = "50000"               # e.g. "50000"
dsn_database = "BLUDB"           # e.g. "BLUDB"
dsn_uid = "<Enter UserID>"       # e.g. "dash104434"
dsn_pwd = "<Enter Password>"     # e.g. "7dBZ39xN6$o0JiX!m"
Let's create a connection URL:
conurl <- paste0("jdbc:db2://",dsn_hostname,":", dsn_port,
                "/", dsn_database,
                ":user=", dsn_username,
                ";password=", dsn_password,
                ";")
Now let's read the data from a sample table into a Spark dataframe:
df <- read.df(sqlContext, source="jdbc", 
            url=conurl,
            dbtable="GOSALESDW.EMP_EXPENSE_FACT")

class(df) #Confirm that df is a Spark dataframe
printSchema(df) #Print the schema of the Spark dataframe
We can now use either SQL, or native Spark dataframe functions to query and manipulate the data. Let's try SQL first:
registerTempTable(df, "tempdf")

results <- sql(sqlContext, "SELECT * FROM tempdf Limit 10")

# results is now a DataFrame
head(results)
And now using Spark dataframe functions:
SparkR::head(df)
To learn how to manipulate data using Spark dataframes there is a tutorial notebook on the Welcome page of the Jupyter notebooks in Data Scientist Workbench.



To learn more about Spark you can take the free Spark Fundamentals course in Big Data University.

 

Feedback and Knowledge Base