SIX sparkling features of Apache Spark!

What is Apache Spark? Why there is a serious buzz going-on about this? If you are into BigData analytics business then, should you really care about Spark? Hope this post will help to answer some of these questions which might have coming to your mind these days.

Apache Spark is a powerful open source processing engine for Hadoop data built around speed, easy to use, and sophisticated analytics. It was originally developed in UC Berkeley’s AMPLab and later-on it moved to Apache. Apache Spark is basically a parallel data processing framework that can work with Apache Hadoop to make it extremely easy to develop fast, Big Data applications combining batch, streaming, and interactive analytics on all your data.

Lets go through some of its features which are really highlighting it in the Bigdata world!

  1. Lighting Fast Processing

When comes to BigData processing speed always matters. We always look for processing our huge data as fast as possible. Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write –  the main time consuming factors – of data processing.

(Spark Performance over Hadoop. Image Courtesy: Cloudera. Visit this link to see how Jai & Matei explains the delightful experience giving by Spark to its developers.)

  1. Ease of Use as it supports multiple languages

Spark lets you quickly write applications in JavaScala, or Python. This helps developers to create and run their applications on their familiar programming languages. It comes with a built-in set of over 80 high-level operators.We can use it interactively to query data within the shell too.

  1. Support for Sophisticated Analytics

In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.

  1. Real time stream processing

Spark can handle real time streaming. Map-reduce mainly handles and process the data stored already. However Spark can also manipulate data in real time using Spark Streaming. Not ignoring that there are other frameworks with their integration we can handle streaming in Hadoop.

Here is what Cloudera says about Sparks Streaming abilities:

  • Easy: Built on Spark’s lightweight yet powerful APIs, Spark Streaming lets you rapidly develop streaming applications
  • Fault tolerant: Unlike other streaming solutions (e.g. Storm), Spark Streaming recovers lost work and delivers exactly-once semantics out of the box with no extra code or configuration
  • Integrated: Reuse the same code for batch and stream processing, even joining streaming data to historical data

(Streaming Performance over Storm. Image

  1. Ability to integrate with Hadoop and existing HadoopData

Spark can run independently. Apart from that it can run on Hadoop 2’s YARN cluster manager, and can read any existing Hadoop data. That’s a BIG advantage! It can read from any Hadoop data sources for example HBase, HDFS etc. This feature of Spark makes it suitable for migration of existing pure Hadoop applications, if that application use-case is really suiting Spark. As Spark is using immutability more all scenarios might not be suitable for migration.

  1. Active and expanding Community

Apache Spark is built by a wide set of developers from over 50 companies. The project started in 2009 and as of now more than 250 developers have contributed to Spark already! It has active mailing lists and JIRA for issue tracking.

Below are some useful links to start with:

If you want to learn basics of Apache Spark then my previous post will help you. It has a training video link which explains Spark simple way.

Poll: Which JavaScript framework you will choose for your Single Page Application (SPA)?

A single-page application (SPA) is a web application or web site that fits on a single web page with the goal of providing a more easy user experience similar to a desktop application. We uses sophisticated JavaScript libraries to develop them. It also will use REST webservices to handle server side code. Below is a simple poll about the SPA.


Apache Spark: A promising framework for Big Data world!

Apache Spark™ is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. It is a fast and general engine for large-scale data processing. We can say it as an engine that increases the computing workloads Hadoop can handle. Also increasing the performance by using in-memory storage during execution. It is a standalone project, but it designed to work with/on top of the Hadoop Distributed File System.

The ecosystem of Spark projects. Source: Databricks

The ecosystem of Spark projects. Source: Databricks

Below is a very useful training video link for Spark beginners by Intellipaat (copy righted to Intellipaat and shared through their YouTube Channel). Hope that will help to get some initial idea about Spark for sure!

Step by Step Guide to create a sample CRUD Java application using MongoDB and Spring Data for MongoDB.

MongoDB is a scalable, high-performance, open source NoSQL database. Instead of storing data in tables as is done in a “classical” relational database, it stores structured data as JSON-like documents with dynamic schemas. This post contains steps to create a sample application using MongoDB and Spring Data for MongoDB.

Spring Data for MongoDB

‘Spring data for MongoDB’ is providing a familiar Spring-based programming model for NoSQL data stores. It provides many features to the Java developers and make their life more simpler while working with MongoDB. MongoTemplate helper class support increases productivity performing common Mongo operations. Includes integrated object mapping between documents and POJOs.  As usual it translates exception into Spring’s portable Data Access Exception hierarchy.  The Java based Query, Criteria, and Update DSLs  are very useful to code all in Java. It also provides a cross-store persistence – support for JPA Entities with fields transparently persisted/retrieved using MongoDB.

You can download it from here: Download

Installing Mongo DB in just 5 steps!

There is no other place on internet which explains more clearly than its official installation reference. Following are the steps which I followed when I did its installation.

1. Download the latest production release of MongoDB from the MongoDB downloads page.

2. Unzip it into any of your convenient location say like


3. MongoDB requires a data folder to store its files. The default location for the MongoDB data directory is C:\data\db. But we can create any folder location for storing data. I want to make it in the same MongoDB folder. So I have created a folder at the below path.


4. That’s it! Go to C:\mongodb\bin folder and run mongod.exe with the data path

C:\mongodb\bin\mongod.exe –dbpath C:\mongodb\data\db

If your path includes spaces, enclose the entire path in double quotations, for example:

C:\mongodb\bin\mongod.exe –dbpath “C:\mongodb\data\db storage place”


5. To start MongoDB, go to its bin folder and run mongo.exe. This mongo shell will connect to the database running on the localhost interface and port 27017 by default. If you want to run MongoDB as a windows service then please see it here.



Okay, this part is done. Let it run there. Now we can create a small Java application with Spring Data.

Creating an application with Spring Data (Another 5 more steps!)

We need below Jars for creating this sample project. As I am a nature lover and a GoGreen person I named the project as “NatureStore”! Using this we are going to “Save” some “Trees” in to the DB!

Step1: Create a simple domain object.

The @Document annotation identifies a domain object that is going to be persisted to MongoDB.  And the  @Id annotation identifies its id.

package com.orangeslate.naturestore.domain;


public class Tree {

	private String id;

	private String name;

	private String category;

	private int age;

	public Tree(String id, String name, int age) { = id; = name;
		this.age = age;

	public String getId() {
		return id;

	public void setId(String id) { = id;

	public String getName() {
		return name;

	public void setName(String name) { = name;

	public String getCategory() {
		return category;

	public void setCategory(String category) {
		this.category = category;

	public int getAge() {
		return age;

	public void setAge(int age) {
		this.age = age;

	public String toString() {
		return "Person [id=" + id + ", name=" + name + ", age=" + age
				+ ", category=" + category + "]";

Step2: Create a simple Interface.

Created a simple interface with CRUD methods. I have also includes createColletions and dropCollections into this same interface.

package com.orangeslate.naturestore.repository;

import java.util.List;

import com.mongodb.WriteResult;

public interface Repository<T> {

	public List<T> getAllObjects();

	public void saveObject(T object);

	public T getObject(String id);

	public WriteResult updateObject(String id, String name);

	public void deleteObject(String id);

	public void createCollection();

	public void dropCollection();

Step 3: Create an implementation class specifically for Tree domain object. It also initializes the MongoDB Collections.

package com.orangeslate.naturestore.repository;

import java.util.List;


import com.mongodb.WriteResult;
import com.orangeslate.naturestore.domain.Tree;

public class NatureRepositoryImpl implements Repository<Tree> {

	MongoTemplate mongoTemplate;

	public void setMongoTemplate(MongoTemplate mongoTemplate) {
		this.mongoTemplate = mongoTemplate;

	 * Get all trees.
	public List<Tree> getAllObjects() {
		return mongoTemplate.findAll(Tree.class);

	 * Saves a {@link Tree}.
	public void saveObject(Tree tree) {

	 * Gets a {@link Tree} for a particular id.
	public Tree getObject(String id) {
		return mongoTemplate.findOne(new Query(Criteria.where("id").is(id)),

	 * Updates a {@link Tree} name for a particular id.
	public WriteResult updateObject(String id, String name) {
		return mongoTemplate.updateFirst(
				new Query(Criteria.where("id").is(id)),
				Update.update("name", name), Tree.class);

	 * Delete a {@link Tree} for a particular id.
	public void deleteObject(String id) {
				.remove(new Query(Criteria.where("id").is(id)), Tree.class);

	 * Create a {@link Tree} collection if the collection does not already
	 * exists
	public void createCollection() {
		if (!mongoTemplate.collectionExists(Tree.class)) {

	 * Drops the {@link Tree} collection if the collection does already exists
	public void dropCollection() {
		if (mongoTemplate.collectionExists(Tree.class)) {

Step 4: Creating Spring context.

Declare all the spring beans and mongodb objects in Spring context file. Lets call it as applicationContext.xml. Note we are creating not created a database with name “nature” yet. MongoDB will create it once we saves our first data.

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns=""
	xmlns:xsi="" xmlns:context=""

	<bean id="natureRepository"
		<property name="mongoTemplate" ref="mongoTemplate" />

	<bean id="mongoTemplate" class="">
		<constructor-arg name="mongo" ref="mongo" />
		<constructor-arg name="databaseName" value="nature" />

	<!-- Factory bean that creates the Mongo instance -->
	<bean id="mongo" class="">
		<property name="host" value="localhost" />
		<property name="port" value="27017" />

	<!-- Activate annotation configured components -->
	<context:annotation-config />

	<!-- Scan components for annotations within the configured package -->
	<context:component-scan base-package="com.orangeslate.naturestore">
		<context:exclude-filter type="annotation"
			expression="org.springframework.context.annotation.Configuration" />


Step 5: Creating a Test class

Here I have created a simple test class and initializing context inside using ClassPathXmlApplicationContext.

package com.orangeslate.naturestore.test;

import org.springframework.context.ConfigurableApplicationContext;

import com.orangeslate.naturestore.domain.Tree;
import com.orangeslate.naturestore.repository.NatureRepositoryImpl;
import com.orangeslate.naturestore.repository.Repository;

public class MongoTest {

	public static void main(String[] args) {

		ConfigurableApplicationContext context = new ClassPathXmlApplicationContext(

		Repository repository = context.getBean(NatureRepositoryImpl.class);

		// cleanup collection before insertion

		// create collection

		repository.saveObject(new Tree("1", "Apple Tree", 10));

		System.out.println("1. " + repository.getAllObjects());

		repository.saveObject(new Tree("2", "Orange Tree", 3));

		System.out.println("2. " + repository.getAllObjects());

		System.out.println("Tree with id 1" + repository.getObject("1"));

		repository.updateObject("1", "Peach Tree");

		System.out.println("3. " + repository.getAllObjects());


		System.out.println("4. " + repository.getAllObjects());

Lets run it as Java application. We can see the below output. First method saves “Apple Tree” into the database. Second method saves “OrangeTree” also into the database. Third method demonstrates finding an object with its id. Fourth one updates an existing object name with “Peach Tree”. And at last; the last method deletes the second object from DB.

1. [Person [id=1, name=Apple Tree, age=10, category=null]]
2. [Person [id=1, name=Apple Tree, age=10, category=null], Person [id=2, name=Orange Tree, age=3, category=null]]
Tree with id 1Person [id=1, name=Apple Tree, age=10, category=null]
3. [Person [id=1, name=Peach Tree, age=10, category=null], Person [id=2, name=Orange Tree, age=3, category=null]]
4. [Person [id=1, name=Peach Tree, age=10, category=null]]

NOTE: You can download all this code from Github!