Category Archives: Java

SIX sparkling features of Apache Spark!

What is Apache Spark? Why there is a serious buzz going-on about this? If you are into BigData analytics business then, should you really care about Spark? Hope this post will help to answer some of these questions which might have coming to your mind these days.

Apache Spark is a powerful open source processing engine for Hadoop data built around speed, easy to use, and sophisticated analytics. It was originally developed in UC Berkeley’s AMPLab and later-on it moved to Apache. Apache Spark is basically a parallel data processing framework that can work with Apache Hadoop to make it extremely easy to develop fast, Big Data applications combining batch, streaming, and interactive analytics on all your data.

Lets go through some of its features which are really highlighting it in the Bigdata world!

  1. Lighting Fast Processing

When comes to BigData processing speed always matters. We always look for processing our huge data as fast as possible. Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write –  the main time consuming factors – of data processing.

(Spark Performance over Hadoop. Image Courtesy: Cloudera. Visit this link to see how Jai & Matei explains the delightful experience giving by Spark to its developers.)

  1. Ease of Use as it supports multiple languages

Spark lets you quickly write applications in JavaScala, or Python. This helps developers to create and run their applications on their familiar programming languages. It comes with a built-in set of over 80 high-level operators.We can use it interactively to query data within the shell too.

  1. Support for Sophisticated Analytics

In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.

  1. Real time stream processing

Spark can handle real time streaming. Map-reduce mainly handles and process the data stored already. However Spark can also manipulate data in real time using Spark Streaming. Not ignoring that there are other frameworks with their integration we can handle streaming in Hadoop.

Here is what Cloudera says about Sparks Streaming abilities:

  • Easy: Built on Spark’s lightweight yet powerful APIs, Spark Streaming lets you rapidly develop streaming applications
  • Fault tolerant: Unlike other streaming solutions (e.g. Storm), Spark Streaming recovers lost work and delivers exactly-once semantics out of the box with no extra code or configuration
  • Integrated: Reuse the same code for batch and stream processing, even joining streaming data to historical data

(Streaming Performance over Storm. Image Courtesy:Cloudera.com)

  1. Ability to integrate with Hadoop and existing HadoopData

Spark can run independently. Apart from that it can run on Hadoop 2’s YARN cluster manager, and can read any existing Hadoop data. That’s a BIG advantage! It can read from any Hadoop data sources for example HBase, HDFS etc. This feature of Spark makes it suitable for migration of existing pure Hadoop applications, if that application use-case is really suiting Spark. As Spark is using immutability more all scenarios might not be suitable for migration.

  1. Active and expanding Community

Apache Spark is built by a wide set of developers from over 50 companies. The project started in 2009 and as of now more than 250 developers have contributed to Spark already! It has active mailing lists and JIRA for issue tracking.

Below are some useful links to start with:

If you want to learn basics of Apache Spark then my previous post will help you. It has a training video link which explains Spark simple way.

Advertisements

11 OPEN Document-Oriented Databases which comes under NoSQL DB Category!

A document-oriented database is a designed for storing, retrieving, and managing document-oriented, or semi structured data. Document-oriented databases are one of the main categories of NoSQL databases. The central concept of a document-oriented database is the notion of a Document. While each document-oriented database implementation differs on the details of this definition, in general, they all assume documents encapsulate and encode data (or information) in some standard format(s) (or encoding(s)). Encodings in use include XML, YAML, JSON and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on).

  • MongoDB:  MongoDB is a collection-oriented, schema-free document database. Data is grouped into sets that are called ‘collections’. Each collection has a unique name in the database, and can contain an unlimited number of documents. Collections are analogous to tables in a RDBMS, except that they don’t have any defined schema.

It store data (which is in BASON – “Binary Serialized dOcument Notation” format) that is a structured collection of key-value pairs, where keys are strings, and values are any of a rich set of data types, including arrays and documents.

Home: http://www.mongodb.org/
Quick Start: http://www.mongodb.org/display/DOCS/Quickstart
Download: http://www.mongodb.org/downloads

  • CouchDB:  CouchDB is a document database server, accessible via a RESTful JSON API.  It is Ad-hoc and schema-free with a flat address space. Its Query-able and index-able, featuring a table oriented reporting engine that uses JavaScript as a query language. A CouchDB document is an object that consists of named fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps.

Home: http://couchdb.apache.org/
Quick Start: http://couchdb.apache.org/docs/intro.html
Download: http://couchdb.apache.org/downloads.html

  • Terrastore: Terrastore is a modern document store which provides advanced scalability and elasticity features without sacrificing consistency. It is based on Terracotta, so it relies on an industry-proven, fast clustering technology.

Home: http://code.google.com/p/terrastore/
Quick Start: http://code.google.com/p/terrastore/wiki/Documentation
Download: http://code.google.com/p/terrastore/downloads/list

  • RavenDB: Raven is a .NET Linq enabled Document Database, focused on providing high performance, schema-less, flexible and scalable NoSQL data store for the .NET and Windows platforms.
    Raven store any JSON document inside the database. It is schema-less database where you can define indexes using C#’s Linq syntax.

Home: http://ravendb.net/
Quick Start: http://ravendb.net/tutorials
Download: http://ravendb.net/download

  • OrientDB: OrientDB is an open source NoSQL database management system written in Java. Even if it is a document-based database, the relationships are managed as in graph databases with direct connections between records. It supports schema-less, schema-full and schema-mixed modes. It has a strong security profiling system based on users and roles and supports SQL as a query languages.

Home: http://www.orientechnologies.com/
Quick Start: http://code.google.com/p/orient/wiki/Tutorials
Download: http://code.google.com/p/orient/wiki/Download

  • ThruDB: Thrudb is a set of simple services built on top of the Apache Thrift framework that provides indexing and document storage services for building and scaling websites. Its purpose is to offer web developers flexible, fast and easy-to-use services that can enhance or replace traditional data storage and access layers.
    It supports multiple storage backends such as BerkeleyDB, Disk, MySQL and also having     Memcache and Spread integration.

Home: http://code.google.com/p/thrudb/
Quick Start: http://thrudb.googlecode.com/svn/trunk/doc/Thrudb.pdf
Download: http://code.google.com/p/thrudb/source/checkout

  • SisoDB:  SisoDb is a document-oriented db-provider for Sql-Server written in C#. It lets you store object graphs of POCOs (plain old clr objects) without having to configure any mappings. Each entity is treated as an aggregate root and will get separate tables created on the fly.

Home: http://www.sisodb.com
Quick Start: http://www.sisodb.com/Wiki
Download: https://github.com/danielwertheim/SisoDb-Provider/

  • RaptorDB: RaptorDB is a extremely small size and fast embedded, noSql, persisted dictionary database using b+tree or MurMur hash indexing. It was primarily designed to store JSON data (see my fastJSON implementation), but can store any type of data that you give it.

Home: http://www.codeproject.com/KB/database/RaptorDB.aspx
Quick Start: http://www.codeproject.com/KB/database/RaptorDB.aspx
Download: http://www.codeproject.com/KB/database/RaptorDB.aspx

  • CloudKit: CloudKit provides schema-free, auto-versioned, RESTful JSON storage with optional OpenID and OAuth support, including OAuth Discovery.

Home: http://getcloudkit.com/
Quick Start: http://getcloudkit.com/api/
Download: https://github.com/jcrosby/cloudkit

  • Perservere: Persevere is an open source set of tools for persistence and distributed computing using an intuitive standards-based JSON interfaces of HTTP REST, JSON-RPC, JSONPath, and REST Channels. The core of the Persevere project is the Persevere Server. The Persevere server includes a Persevere JavaScript client, but the standards-based interface is intended to be used with any framework or client.

Home: http://code.google.com/p/persevere-framework/
Quick Start: http://code.google.com/p/persevere-framework/w/list
Download: http://code.google.com/p/persevere-framework/downloads/list

  • Jackrabbit: The Apache Jackrabbit™ content repository is a fully conforming implementation of the Content Repository for Java Technology API (JCR, specified in JSR 170 and 283). A content repository is a hierarchical content store with support for structured and unstructured content, full text search, versioning, transactions, observation, and more.

Home: http://jackrabbit.apache.org
Quick Start: http://jackrabbit.apache.org/getting-started-with-apache-jackrabbit.html
Download: http://jackrabbit.apache.org/downloads.html

Conclusion:
Document databases store and retrieve documents and basic atomic stored unit is a document.  As always your requirement leads into the decision. You need to think about your data-access patterns / use-cases to create a smart document-model. When your domain model can be split and partitioned across some documents, a document-database will be a suitable one for you. For example for a blog-software, a CMS or a wiki-software a document-db works extremely well. But at the same time a non-relational database is not better than a relational one in some cases where  your database have a lot of relations and normalization.

Just check the following link from stackoverflow also to cover the pros/cons of Relational Vs Document based databases.
http://stackoverflow.com/questions/337344/pros-cons-of-document-based-databases-vs-relational-databases

Wink – A framework for RESTful web services from Apache

Apache Wink 1.0 is a complete Java based solution for implementing and consuming REST based Web Services. The goal of the Wink framework is to provide a reusable and extendable set of classes and interfaces that will serve as a foundation on which a developer can efficiently construct applications.

Taken from Apache Wink official site: Click Here

Wink consists of a Server module for developing REST services, and of a Client module for consuming REST services. It cleanly separates the low-level protocol aspects from the application aspects. Therefore, in order to implement and consume REST Web Services the developer only needs to focus on the application business logic and not on the low-level technical details.

REST Web Service design structure

The Wink Server module is a complete implementation of the JAX-RS v1.0 specification. On top of this implementation, the Wink Server module provides a set of additional features that were designed to facilitate the development of RESTful Web services.

The Wink Client module is a Java based framework that provides functionality for communicating with RESTful Web services. The framework is built on top of the JDK HttpURLConnection and adds essential features that facilitate the development of such client applications.

How to create a RESTful service using Wink? [Coming soon – next post 🙂 ]

Story of an Old Carpenter

“Your life today is the result of your attitudes and choices in the past. Your life tomorrow will be the result.”

This is a story of an elderly carpenter who had been working for a contractor for the past 53 years. He had built many beautiful houses but now as he was getting old, he wanted to retire and lead a leisurely life with his family. So, he goes to the contractor and tells him about his plan of retiring. The contractor feels sad at the prospect of losing a good worker but agrees to the plan because the carpenter had indeed become too fragile for the tough building work. But as a last request, he asks the old carpenter to construct just one last house.
The old man agrees and starts working but his heart was not in his work any more. He had lost the motivation towards work. So, he resorted to shoddy workmanship and constructed the house half-heartedly. After the house was built, the contractor came to visit his employee’s last piece of work. After inspecting the house, he handed over the front door keys to the carpenter and said, “This is your new house. My gift to you.” The carpenter was shocked and upset. Had he known that he was building his own house, he would have done a better job! Now, he would have to live in the house, which is not worth staying.
Think of yourself as the carpenter. You work hard every day but are you giving your best? We put our least to the work we don’t like or do not have interest in. Later, we get shocked at the situation we have created for ourselves and try to figure out why we didn’t do it differently.
Enjoy your tasks and carry on your responsibilities with pleasure and not with pain. “Life is a do-it-yourself project”. Do your job enthusiastically and with devotion, a positive output and a pleasing life will certainly be on your way.

Apache Geronimo Project

The goal of the Geronimo project is to produce a server runtime framework that pulls together the best Open Source alternatives to create runtimes that meet the needs of developers and system administrators. Our most popular distribution is a fully certified Java EE 5 application server runtime.

Some of our guiding principles are:

  • Easy to use.
  • Build servers that are distributed under the Apache Software License.
  • Provide runtimes that meet the needs of developers, administrators and system integrators.
  • Integrate with the best open source tooling available like Eclipse.
  • Provide frequent releases of our software so users can experience the newest features and have access to the latest bug fixes.
  • Build a community that incorporates multiple disciplines required to create complex runtime and toolable infrastructure.
IceRocket Tags: ,

Creating Mock Tests: Using Easy mock

Unit testing is now a "best practice" for software development. In this unit testing we have to face so many situations where we need to interact with Database or any other resources. But at the same time we need to make our Tests isolated too. Here comes the importance of Mock objects.

Mock objects are a useful way to write unit tests for objects that act as mediators. “Instead of calling the real domain objects, the tested object calls a mock domain object that merely asserts that the correct methods were called, with the expected parameters, in the correct order.”

Using EasyMock Framework

EasyMock is a framework for creating mock objects using the java.lang.reflect.Proxy object. When a mock object is created, a proxy object takes the place of the real object. The proxy object gets its definition from the interface or class you pass when creating the mock.
EasyMock is providing two APIs for creating mock objects that are based on interfaces, the other on classes (org.easymock.EasyMock and org.easymock. classextensions.EasyMock respectively).

We can separate the EasyMock implementation into FOUR steps 

1. Creating a Mock Object using “EasyMock.createMock”.

Create Mock : Using this static method we are creating a mock object. And this is the first step which we need to do in mock testing.

When we creates this mock objects then we can follow three levels.

Regular: If we are expecting some methods to be executed then it was not executed then the test will fail.  And if any unexpected tests executed then also test will fail. Here the order of the method execution is not important.

Ex: EmpDAO empDAO = EasyMock.createMock(EmpDAO.class);

Nice: If we are expecting some methods to be executed then it was not executed then the test will fail. And if any unexpected tests executed then it will return a default value. Here also order is not important.

Ex: EmpDAO empDAO = EasyMock.createNiceMock(EmpDAO.class);

Strict: Same as regular but here the Order of the expected methods also important.

Ex: EmpDAO empDAO = EasyMock.createStrictMock(EmpDAO.class);

2. Expecting mock object method calls using “EasyMock.expect”.

This is used to expect some method calls from our mock object. Lets go through one example.

Lets assume we have the following methods which gets employee information from DB.

List<Employee> employee = empDao.getEmpDetails();

List<Employee> employee = empDao.getEmpDetailsByName(“bond”);

In the unit test we need to follow as follows…

EmpDao mockDao = EaskMock.createMock(EmpDao.class);

Employee mockEmp = new Employee();
mockEmp.setEmpName(“bond”);
mockEmp.setEmpCode(“007”);

List<Employee> empList= new ArrayList<Employee>(1);
empList.add(mockEmp);

expect(mockDao.getEmpDetails()).andReturn(empList);
expect(mockDao.getEmpDetailsByName(“bond”)).andReturn(empList);
replay(mockDao);

3. Registering/replaying expected methods using “EasyMock.replay”.

Once the behavior of the mock objects has been recorded with expectations, the mock objects must be prepared to replay those expectations. We are using replay() method for this purpose. EaskMock will stop the expecting behavior once this method calls.

EasyMock.replay(mockDao);

4. Verifying the expected methods using “EasyMock.verify”.

Veirfying the mock expectations is the final step which we need to follow. This includes validating that all methods that were expected to be called were called and that any calls that were not expected are also noted.

EasyMock.verify(mock);

Easymock is providing more functionalities like “Matchers” etc for more unit testing flexibility. EasyMock has been the first dynamic Mock Object generator, relieving users of hand-writing Mock Objects, or generating code for them. It helps us to increase our testing coverage a lot.

Analyze your Unit test coverage with Cobertura…

Writing Unit test cases is now mandatory for every developers in the Companies. But at the same time it is very important that we need to analyze the Unit test coverage that is done by that developer. Just writing some test cases will not make sense. Here comes the job of Cobertura.

Cobertura is an open source tool used to analyze the Coverage of Unit tests. It will give Percentage values for the Line coverage as well as Branch coverage.

Some features  are listed below.

  • Can be executed from ant or from the command line.
  • In maven case we have Corbertura Maven plugins.
  • Instruments Java bytecode after it has been compiled.
  • Can generate reports in HTML or XML.
  • Shows the percentage of lines and branches covered for each class, each package, and for the overall project.
  • Shows the McCabe cyclomatic code complexity of each class, and the average cyclomatic code complexity for each package, and for the overall product.
  • Can sort HTML results by class name, percent of lines covered, percent of branches covered, etc. And can sort in ascending or decending order.

Screenshots:

Coverage