Big data is a topic that everyone seems to be talking about it. But still many of us wonder “What exactly is big data”. Which technology provider should I use?? I have written couple of blog on apache hadoop, cloudera distribution and AWS EMR service as well. And today in this blog, I’ll go through Oracle Big data Cloud Service and what is included in the service.
What is Oracle Big data cloud Service???
Oracle Big Data Cloud Service is an automated cloud service for big data processing. It is optimized to run different set of workloads from Hadoop-only workloads (ETL, Spark, Hive) to interactive SQL queries using SQL-on-hadoop tools. Here are some key features of Oracle cloud big data service:
- Create cloudera certified cluster in quick time.
- Cluster set up is always fault tolerant with HA hadoop and security infrastructure
- Fully tested hadoop upgrades ( version skipping supported)
- Maximum versatility: With cloudera distribution including Apache hadoop enterprise data hub, you can use hadoop, hive, impala, spark etc… Also you can install and operate third-party tools.
Oracle provides various different Big data deployment models.
- Big Data Appliance X6: Oracle Provide the hardware and software for all your big data needs. So one vendor for storage, computing and processing big data.
- Big Data Cloud for Customers (BDCC) : Big data cloud for customer is hosted on-premises at customer’s side. So you have the cloud for the cloud services, but for your data that is local to you. But remember, it’s Oracle’s provided host so oracle will take care of everything for you in terms of hardware/software implementation. You still have all the powerful capability of all the cloud services to process and store your Big Data needs, but it is now being more closer to what you might be familiar with, your infrastructure. So this is now using your backup methods, your backup infrastructure, and your staff making sure that the physical machine that is on your on premises is up and running.
- Big Data Cloud Service (BDCS): Big Data cloud service is hosted in oracle data center. The only thing you need is a subscription to oracle cloud service. With that, you have the ability to store all your data and to process it with all the Hadoop interfaces.
Oracle BDCS Architecture overview:
Oracle big data cloud service is a collection of nodes. Three types of nodes.
- Permanent nodes:
- As name suggests these are permanent nodes of the cluster.
- Master or data node of the hadoop cluster or any nodes containing the hadoop roles.
- Each node has 32 OCPU’s, 256 GB RAM and 48TB storage along with your cloudera distribution for hadoop.
- Edge nodes
- These are also permanent nodes but don’t have data. No data Node role.
- Edge node contains the hadoop client configs and provide interface between the hadoop cluster and the outside network.
- Are commonly used to run client applications and cluster admin tools.
- Cluster computer nodes
- Cluster compute nodes have only OCPU’s and memory ( no storage)
- You can add/remove as needed without impacting the cluster.
- Cluster can be extended up to 15 cluster compute nodes.
The software included in Oracle BDCS are :
- Oracle Linux 6 with unbreakable kernel
- Oracle Java – JDK8
- Cloudera Enterprise ( Data hub Edition) 5.X
- Cloudera distribution including Apache hadoop (CDH) with support for YARN and MR2
- Cloudera Impala
- Cloudera Search
- Apache Spark
- Oracle R Distribution
- Oracle Big data spatial and graph
How to create OBCC Instance:
Login to oracle cloud account and from dashboard click on create Instance. The create instance will list all the services that you can create in your Oracle cloud account as per region. Click on create BigData service.
The create New Oracle Big Data Cloud Service instance wizard is displayed and now fill up the following information.
- Instance Name: bdcstest01
- Region: Select the region
- Plan : Oracle Big Data Cloud Service
- Start Pack : 1 ( which will allows three nodes)
- Admin detail : email and username etc..
- Additional Nodes : Starter pack always create 3 nodes. To avoid extra cost don’t create additional node. But can be added later on.
- Click Create
A box pops up and asks for confirmation. I’ve selected to create an instance for the service of Big Data Cloud Service. Do I wish to continue? I click Create. Admin will be notified by email that instance has been created.
Now the status of the instance is active. We are now ready to processed to next step to create the cluster.
Create the cluster:
Go to the service instance section of the service details: Oracle Big Data Cloud service page, click the open service console link next as highlighted above screenshot.
- Click on create instance to “create cluster”.
- Fill up the first tab with instance information.
- Instance Name : bdcstest
- Email : email@example.com
- Tags : test01 ( or create department or cost center level etc..)
- Description: Test cluster.
- click Next
- Second page fill up the detailed information.
- Big data appliance system : bdcstest01 ( instance we created in previous section)
- SSH public key: Three option are available. I copied existing public Key. You can create new and download and save it on your PC.
- Cloudera admin password: ****
- Secure Set up: default enabled ( recommended)
- Oracle Storage cloud service : Skip for now as we’ll not use oracle storage cloud service for our cluster.
- Click create
- Review the configuration and click create.
- Refresh to see the status. Will take time to create the cluster.
- Now the cluster is ready and three nodes are provisioned as part of the cluster setup. All nodes are permanent nodes.
Connect to cluster:
You can connect to cluster by using putty on windows. But need SSH private key. So if you have created the new keypair during Oracle BDCS instance creation, the first thing you need to do is to convert the key to a format that can be used by putty. Have a look at this article.
Now look at the lowest IP address in the cluster of nodes and connect using putty. I choose lowest IP address because its more likely will be the primary node. How to connect by using as private key is explained in my old post here.
I connected successfully using putty. Now check the cluster set up using bdacli. As i mentioned the lowest number will be the primary node and same is reflected below.
Now I would like to connect to cloudera manager. So open the firewall from your primary node. For simplicity I opened all IP address. But for your organization, please review and open specific IP addresses.
### Access to Cloudera Manager bash-4.1# bdacli bdcs_whitelist allow cloudera_manager 0.0.0.0/0 BDCS Network Services Firewall & Whitelist Changes saved. ### Access to Hue bash-4.1# bdacli bdcs_whitelist allow hue 0.0.0.0/0 BDCS Network Services Firewall & Whitelist Changes saved. bash-4.1#
Open Cloudera Manager:
You can access cloudera manager from Oracle big data cloud service console or directly form a browser.
Login page will appear. Add username : admin and password : that you specified during cluster creation.
Adding Nodes to a cluster:
We can extend a cluster by adding permanent hadoop nodes, edge nodes and cluster compute nodes. The recommendation is add nodes in one-node increments up to 60 nodes in the cluster. You can add additional permanent nodes to a cluster after it is created and started.
Go to service instance for instance we created for big data, click on menu to modify and add nodes.