Oracle Big Data Cloud Service – Introduction

Big data is a topic that everyone seems to be talking about it. But still many of us wonder  “What exactly is big data”. Which technology provider should I use?? I have written couple of blog on apache hadoop, cloudera distribution and AWS EMR service as well. And today in this blog, I’ll go through Oracle Big data Cloud Service and what is included in the service.

What is Oracle Big data cloud Service???

Oracle Big Data Cloud Service is an automated cloud service for big data processing. It is optimized to run different set of workloads from Hadoop-only workloads (ETL, Spark, Hive)  to interactive SQL queries using SQL-on-hadoop tools. Here are some key features of Oracle cloud big data service:

  • Create cloudera certified cluster in quick time.
  • Cluster set up is always fault tolerant with HA hadoop and security infrastructure
  • Fully tested hadoop upgrades ( version skipping supported)
  • Maximum versatility:  With cloudera distribution including Apache hadoop enterprise data hub, you can use hadoop, hive, impala, spark etc… Also you can install and operate third-party tools.

Continue reading → Oracle Big Data Cloud Service – Introduction

HDFS Command line – Manage files and directories.

In my previous blog, we have configured hadoop single and cluster set up. Now try to create the files and directories on Hadoop distributed file system(HDFS).  You can see the full list here.

When I started the hdfs commands I got confused with three different command syntax. All three commands appears to be same but have some differences as explained below.

  • hadoop fs {args}

FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are dealing with different file systems such as Local FS, (S)FTP, S3, and others.

  • hadoop dfs {args}

dfs is very specific to HDFS. would work for operation relates to HDFS. This has been deprecated and we should use hdfs dfs instead.

  • hdfs dfs {args}

same as 2nd i.e would work for all the operations related to HDFS and is the recommended command instead of hadoop dfs Continue reading → HDFS Command line – Manage files and directories.

Oracle 18c – PDB Snapshot Carousel

A PDB snapshot is a named copy of a PDB at a specific point in time. When a PDB is enabled for PDB snapshots, you can create up to eight snapshots of it. The set of snapshots is called a snapshot carousel. PDB Snapshot Carousel is a new feature of Oracle Database 18c. When maximum limit of 8 snapshots of PDB is reached, then  new snapshot overwrite the oldest copy.

PDB snapshot carousel keeps the external log for the purpose of using it in the following cases:

  • Generate non-productive environments.
  • Recovery of a Productive PDB before a problem.

The snapshots include the copy of the data files of the original PDB, excluding the archived redo logs. This instant copy is stored on disk and by default it is in the same directory as the datafile.

Snapshot Configuration of a PDB:

The MAX_PDB_SNAPSHOTS property specifies the maximum number of snapshots permitted in the carousel. The current setting is visible in the CDB_PROPERTIES view.

You can change the maximum PDB snapshot value and setting value to Zero will drop all the snapshots. Continue reading → Oracle 18c – PDB Snapshot Carousel

Analyze Big data with EMR

Amazon Enterprise MapReduce is a fully managed cluster platform that process and analyze larger amount of data.  When you run a large amount of data you eventually run into processing problems. By using hadoop cluster EMR can help in reducing large processing problems and split big data sets into smaller jobs and distribute them across many compute nodes. EMR can do this with big data framework and open source projects. Big data framework includes :

  • Apache Hadoop, Spark, Hbase
  • Presto
  • Zeppelin, Ganglia, Pig, hive etc..

Amazon EMR mainly used for log processing and analysis, ETL Processing, Clickstream analysis and Machine learning.

EMR Architecture:

Amazon EMR architecture contains following three types of nodes:

  • Master Nodes:
    • EMR have Single Master Node and don’t have another master node to fail over.
    • Master node manages resources of the cluster
    • Co-ordinates distribution and parallel execution of MapReduce executable.
    • Tracking and directing HDFS.
    • Monitor health of core and task nodes.
    • Resource Manager also running on master nodes which is responsible for scheduling the resources.
  • Core  nodes:
    • Core nodes are slaves nodes and run the tasks as directed by master node.
    • Core contains data as part of HDFS or EMRFS. So data daemons runs on core nodes and store the data.
    • Core nodes also run NodeManager which takes action from Resource Manager like how to manage the resources.
    • ApplicationMaster is task which negotiates the  resources with Resource Manager and working with NodeManager to execute and monitor application containers.
  • Task Nodes:
    • Task nodes also controlled by master and are optional.
    • These nodes are required to provide extra capacity to the cluster in terms of CPU and memory
    • Can be added/removed  any time   from running cluster.

Continue reading → Analyze Big data with EMR

Amazon Machine Learning – Introduction

Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. With rise of big data machine learning become a key technique for solving problems.

Machine learning uses two types of techniques:

  • Unsupervised
    • Used for unlabeled data and where we don’t know the output.
    • Self guided learning algorithm
    • Clustering Technique: Aim is to use exploratory data analysis to find hidden patterns or groupings in data.
  • Supervised
    • Used for labelled data and desired output is known.
    • providing the algorithm training data to learn from
    • Techniques available:  Classification and Regression

Amazon Machine Learning:

Amazon ML is a robust machine learning platform that allow developers to train predictive models. Amazon ML creates models from supervised data sets. The process of creating a model from set of known observation called training data. When setting up a new model in Amazon ML, we first need to upload our data. Data needs to be CSV-formatted, with the first row containing the name of each data field, and each following row containing the data samples. Training data sets can be huge, so they need to be uploaded from either Amazon S3 or Redshift storage.

To test the amazon ML, I uploaded the two datasets to S3. I used customer review data to predict whether customer will like the restaurant or not. And second one is to predict House pricing based on previous sale. Continue reading → Amazon Machine Learning – Introduction