Tuesday, January 21, 2014

Oracle In-database MapReduce in 12c (big data)

There is some interest from the field about what is In-database map-reduce option and why and how it is different than hadoop solution.
I though I will share my thoughts on it.

 In-database map-reduce is an umbrella term that includes two features.
  •             "SQL Map-reduce" or  "SQL pattern matching".
  •              In database container for Hadoop.  to be released in future release. 


  • "SQL MapReduce" : Oracle database 12c introduced a new feature called PATTERN MATCHING using "MATCH_RECOGNIZE" clause in SQL. This is one of the latest ANSI SQL standards proposed and implemented by Oracle. The new sql syntax helps to intuitively solve complex queries that are not easy to implement using 11g analytical functions alone. Some of the use cases are fraud detection, gene sequencing, time series calculation, stock ticker pattern matching . Etc.  I found most of the use case for Hadoop can be done using match_recognize in database on structured data. Since this is just a SQL enhancement , it is there in both Enterprise & Standard Edition database.


  • "In database container for Hadoop  (beta)" : if you have your development team more skilled at Hadoop and not SQL , or want to implement some complex pre-packaged Hadoop algorithms, you could use oracle container for Hadoop (beta). It is a Hadoop prototype APIs  which run within the java virtual machine in the database. It implements Hadoop Java APIs and interfaces with database using parallel table functions to read data in parallel. One interesting fact about parallel table functions is that it can run in parallel across RAC cluster and also can also route data to a specific parallel processes . This functionality is the key in making Hadoop scale across clusters and  this functionality exited in database for over 15 years now.  Advantage of in-database Hadoop  is 
  1.  No need to move data out of database for running Mapreduce functions and hence save time and resources.
  2.  More  real time data could be used.
  3.  Less redundant copies of data and hence better security & less disk space used.
  4.  The servers could be used for not just MapReduce work, but also used to run the database making better resource utilization,
  5. The output of the MapReduce is immediately available for analytic tools and can combine this functionality along with database features like "in-memory option (beta) to get near real time analysis of Big Data. 
  6. Combine db features for security. Backup, auditing, performance with MapReduce. API.
  7. The ability to stream the output of one parallel table function as input to the next parallel table function has an advantage of not needing to maintain any intermediate stages.
  8. Features like graphical, test, spacial and semantic within oracle database can be used for further analysts.
In addition to this, Oracle 12c will support schema less access using JSON protocol. That will help big data use cases of NOSQL to run on data within Oracle database as well.  

Conclusion.
Having these features will help to solve MapReduce challenges when the data is mostly within database and reduce data movement and make better use of available resources.. 
If Most of your data is outside the DB, then sql Connectors for hadoop and Oracle Loader for Hadoop could be used. 


Reference

1) presentation from Kuassi Mensah 

1 comment:

Feedback welcome