How to work with Hadoop data using SQL in a Scala Jupyter notebook

IBM Big SQL provides standards-compliant SQL access to data in Hadoop. Developers familiar with SQL can access data in Hadoop without having to learn new languages or skills.

You can work with Hadoop data using SQL in a Jupyter Notebook.The required libraries are pre-installed in your Jupyter Notebook environment, so you can establish a connection to a remote Hadoop cluster with Big SQL and then run SQL queries over Hadoop data.

Credentials

Big SQL Technology Sandbox is a large, shared cluster powered by Hadoop. You can use it to run R, SQL, Spark, and Hadoop jobs. It is a high performance environment demonstrating the advantages of parallelized processing of big data sets.

For credentials, sign up for an account on Demo Cloud. Your username and password there will be used for new SQL connections.

Access Hadoop data using SQL

Create a new Jupyter notebook in Data Scientist Workbench. Set it to use Scala.



Enter your Big SQL Technology Sandbox username and password in a new cell.
val username = "my_demo_cloud_username";

val password = "my_demo_cloud_password"
Notice: Your username is different from your email address. For example, the username for jane.doe@example.com might be janedoe. You can see your username in the top right corner of Demo Cloud when you're logged in.

Run the cell above.

Create a new cell.

Enter the other connection details for our cluster.
val database = "bigsql";
val hostname = "iop-bi-master.imdemocloud.com";
val port = "32051"
Run the cell above. Create a new cell.

Create the connection using the code below.
import java.sql.{Connection, DriverManager, ResultSet};
java.sql.DriverManager.registerDriver(new com.ibm.db2.jcc.DB2Driver);

val url = List("jdbc:db2://", hostname, ":", port, "/", database).mkString("");

var connection = java.sql.DriverManager.getConnection(url, username, password)

Run the cell above. If something didn't work, make sure you set the username and password variables in the first cell to your Demo Cloud username and password.

gosalesdw.emp_employee_dim is a sample table in the bigsql database.

Create a new cell. Run a SQL query against the sample data.
val sql = "select * from gosalesdw.emp_employee_dim";
val results = connection.createStatement.executeQuery(sql)
Create a new cell. Let's print the results.
while ( results.next() ) {
    val name = results.getString("EMPLOYEE_NAME");
    val key = results.getString("EMPLOYEE_KEY");
    println("Employee key, name = " + key + ", " + name)
}
Cool! You've accessed data in a Hadoop cluster using a SQL connection from a Jupyter notebook.

Finally, as a best practice we should close the database connection once we're done with it.
connection.close()

Feedback and Knowledge Base