Cassandra DataStax – Developer Guide with Spring Data Cassandra


We will discus Cassandra implementation :

Important Points:

Download and Installation:

  1. Tarball Installation

  1. DataStax DB
    • You need to register yourself with DataStax for download.
    • DataStax Enterprise – http://www.datastax.com/download#dl-enterprise.
    • Create These folders and changed permission:
    • sudo chmod 777 /var/log/cassandra
      sudo mkdir -p /var/lib/cassandra/data
      sudo chmod 777 /var/lib/cassandra/data
      sudo mkdir -p /var/lib/cassandra/commitlog
      sudo chmod 777 /var/lib/cassandra/commitlog
      sudo mkdir -p /var/lib/cassandra/saved_caches
      sudo chmod 777 /var/lib/cassandra/saved_caches
    • How to run Cassandra: Go to DataStax Cassndra installed folder on Mac/Linux/Unix env:                                                                                          
                                                                                                                                                                        cd /Users/<userName>/dse-4.5.2/bin</pre>
      <pre>sudo ./dse cassandra -f
       
      //This above command Cassandra DB on your local system. Hit enter to quit from ruining server in background and start CQL query console.
      sudo ./cqlsh
      
  • Create Schema:
    CREATE SCHEMA event_owner WITH replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’ : 1 };
    1. Schema(Keyspace) name: event_owner
    2. Table name: event_audit
    Table creation syntax:Please find the revised data model and details below.Note: I have used expanded names for easier understanding which can be shortened later on.
    CREATE TABLE  event_owner.event_audit (
    ctg    text,
    month       timestamp,
    ctgid      text,
    ver     timeuuid,
    userid      text,
    action      text,
    content     text,
    PRIMARY KEY ((category,month),cat_id,version)
    ) WITH CLUSTERING ORDER BY (cat_id ASC, version DESC);
    Sample Data
  • category | month                    | cat_id | version                              | action | content      | userid
    ———-+————————–+——–+————————————–+——–+————–+———–
    CC | 2014-01-01 05:30:00+0530 |   8000 | b3fc48e0-5608-11e4-aacd-e700f669bcfc |  DRAFT | json content | 155045940
    CC | 2014-01-01 05:30:00+0530 |   9000 | a4747460-5608-11e4-aacd-e700f669bcfc |  DRAFT | json content | 155045940
    Description

    • category
      • Commitcode / Part Association/ EventTag [CC / PA / ET]
    • month
      • 12 AM timestamp of first day of the month the change is made
    • cat_id 
      • Unique id for a particular category [Say inc axe of commit code it is cc_id say 9000]
    • version 
      • it is the unique id that indicates the version number. 
      • You can populate it using now() function. 
      • It has an embedded timestamp that can be used to know the timestamp. Use dateOf() function to get the timestamp value.
    • userid
      • id of the user who made the change
    • action 
      • SAVEDRAFT/ PUBLISH / DELETE etc.
    • content 
      • actual json content after the change [full json]

  • Sample Query to access the data
    cqlsh:cdb> select * from audit where category=’CC’ and month=’2014-01-01 05:30:00+0530′ and cat_id=’9000′;
    category | month                    | cat_id | version                              | action | content      | userid
    ———-+————————–+——–+————————————–+——–+————–+———–
    CC | 2014-01-01 05:30:00+0530 |   9000 | a4747460-5608-11e4-aacd-e700f669bcfc |  DRAFT | json content | 155045940
  • SQL Like Commands: You can use same standard SQL DDL/DML commands/syntax for Cassandra query, like-
    • Drop table 
    • Update
    • Delete
    • Truncate
    • Select query

    2. DataStax OpsCenter- 

http://www.datastax.com/what-we-offer/products-services/datastax-opscenter

    3. DataStax DevCenter

        Installation: DataStax DevCenter is a visual CQL query tool for Cassandra and DataStax Enterprise.

        How to start OpsCenter GUI: http://www.datastax.com/documentation/getting_started/doc/getting_started/gettingStartedInstallOpscTar_t.html

2. Package Installation:

DataStax All-in-One Installer

How to run DataStax?

  • Cassandra:
    • Go to your /Users//dse via Terminal and execute the following command:
      • sudo ./bin/dse cassandra -f (This will start the cassandra)
  • Opscenter and DataStax-Agent:
    • Go to /Users//dse in new Tab and execute command : 
      • sudo ./opscenter/bin/opscenter -f (This will start the opscenter)
    • Go to /Users//dse in new Tab and execute command :
      • sudo ./datastax-agent/bin/datastax-agent -f (This will start the datastax-agent)

Using DataStax:

  • Now you can see the Opscenter in Browser in following address:
    • localhost:8888/
  • Here You can able to see the your Cassandra cluster visually, (Kind of monitoring tool for cassandra)

 

Integrate Cassandra with Spring Data Cassandra :

  1. /**
     * Created by: Rajiv Srivastava
     */
    @Configuration
    @ComponentScan(basePackages = {com.cassandraproject.dao,com.cassandraproject.utils})
    @EnableCassandraRepositories(basePackages ={com.cassandraproject.repository})
    public class AuditCoreContextConfig extends AbstractCassandraConfiguration {
    
     @Override
     protected String getKeyspaceName() {
     return event_owner; //Schema or Keyspace name
     }
    
     @Override
     protected String getContactPoints() {
     return localhost;//IP address of server/local machine. Host of a clusters can be separated with comma (,) like host1,host2. Also minimum two host should be added, so that second Cassandra server will be connected if first is down.
     }
    
     @Override
     protected int getPort() {
     return 9042; //Cassandra DB port
     }
    }
    

2. Data Modelling

 a. Primary/Clustered/Partioned key


/**
 * Created by: Rajiv Srivastava
 */

/* Keyspac/Schema- event_owner
 * CREATE TABLE event_owner.event_audit (
 ctg text,
 month timestamp,
 ctgid text,
 ver timeuuid,
 userid text,
 action text,
 content text,
 PRIMARY KEY ((ctg,month),ctgid,ver)
 )WITH CLUSTERING ORDER BY (ctgid ASC, ver DESC);

*/

@PrimaryKeyClass
public class EventAuditKey implements Serializable {

private static final long serialVersionUID = 1L;

@PrimaryKeyColumn(name = ctg, ordinal = 0, type = PrimaryKeyType.PARTITIONED)
 private String category;

 @PrimaryKeyColumn(name = month, ordinal = 1, type = PrimaryKeyType.PARTITIONED)
 private Date month;

 @PrimaryKeyColumn(name = ctgid, ordinal = 2, type = PrimaryKeyType.CLUSTERED, ordering =Ordering.ASCENDING)
 private String categoryId;

 @PrimaryKeyColumn(name = ver, ordinal = 3, type = PrimaryKeyType.CLUSTERED, ordering = Ordering.DESCENDING)
 private UUID version;

 3. Repository


/**
 * Created by: Rajiv Srivastava*/
@Repository
public interface AuditRepository extends CrudRepository <EventAudit, EventAuditKey> {

	@Query(select * from event_owner.event_audit)
	public List<EventAudit&amp> eventAudit();
}

4. CRUD Operation using JPA/CrudRepository – DAO Layer


/**
 * Created by: Rajiv Srivastava*/
@Component
public class EventDaoImpl implements EventAuditDao {

 @Autowired
 public AuditRepository auditRepository;

 @Override
 public void save(EventAudit entity) {
 auditRepository.save(entity);

 }

 @Override
 public void save(Collection<EventAudit> entities) {
 auditRepository.save(entities);
 }

 @Override
 public EventAudit find(EventAuditKey eventAuditKey) {
 if(null==eventAuditKey){
 throw new IllegalArgumentException(It doesn't has all required instance variable set);
 }
 return auditRepository.findOne(eventAuditKey);
 }

 @Override
 public List<EventAudit> getAll() {

 Iterable<EventAudit> iterable=auditRepository.findAll();
 if(null != iterable.iterator()){
 return Lists.newArrayList(iterable.iterator());
 }
 return new ArrayList<>();
 }

 @Override
 public List<EventAudit> getListEventAuditMonthCategoryWise(Date date, String Category) {
 // TODO Auto-generated method stub
 return null;
 }
}

Apache Cassandra: 

  * Getting started: http://wiki.apache.org/cassandra/GettingStarted

  * Join us in #cassandra on irc.freenode.net and ask questions

  * Subscribe to the Users mailing list by sending a mail to

    user-subscribe@cassandra.apache.org

  * Planet Cassandra aggregates Cassandra articles and news:

    http://planetcassandra.org/

For more on what commands are supported by CQL, see

https://github.com/apache/cassandra/blob/trunk/doc/cql3/CQL.textile.  A

reasonable way to think of it is as, “SQL minus joins and subqueries.”

Advertisements

MongoDB + Java + Maven Sample Application


Prerequisite: Download and Install MongoDB (http://docs.mongodb.org/manual/installation/) on your local machine and run two instances in two separate terminal/command prompt. This sample app is developed on Mac OS/Linux env.

1. Start MongoDB server using : ./mongod

2. Start MongoDB client : ./mongo

Add MongoDB Jar dependency in your project:

 <dependency>
	<groupId>org.mongodb</groupId>
	<artifactId>mongo-java-driver</artifactId>
	<version>2.10.1</version>
 </dependency>

 

Sample MONGODB CRUD code:

package com.db;
import java.util.List;
import java.util.Set;

import com.mongodb.BasicDBObject;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBCursor;
import com.mongodb.DBObject;
import com.mongodb.MongoClient;

public class MongoConnectionManager {
	static DBCursor cursor;

	public static void main(String[] args) {

	try{
		System.out.println("Connecting with MongoDB......");
		//DB connection after ver 2.10.0
		MongoClient mongoconn = new MongoClient("localhost",27017 );
		//Get database
		DB db = mongoconn.getDB( "test" );
		System.out.println("Database Name: "+db.getName());

		//Display all databases
		List<String> dbs = mongoconn.getDatabaseNames();
		for(String mydb : dbs){
			System.out.println(mydb);
		}

		//If MongoDB is installed on Secure mode
		//boolean auth = db.authenticate("username", "password".toCharArray());

		//Get Collection/tables names
		Set<String> colls = db.getCollectionNames();

		for (String s : colls) {
		    System.out.println(s);
		}

		//Get Collection

		DBCollection coll = db.getCollection("mynames");
		coll.createIndex(new BasicDBObject("email", 1));  // create index on "i", ascending

		System.out.printf("Collection Names:"+db.getCollectionNames());

		//Drop collections
		coll.drop(); 

		//Insert new records/documents
		BasicDBObject doc1 = new BasicDBObject();

		//Insert values into document/record
        doc1.put("name", "Rajiv");
        doc1.put("email", "rajivmca2004@yahoo.com");
        doc1.put("count",1);

     // Insert values in Collection
        coll.insert(doc1);

        BasicDBObject doc2 = new BasicDBObject();

		//Insert values into document/record
        doc2.put("name", "Mohit");
        doc2.put("email", "mohit@yahoo.com");
        doc2.put("count",2);

       // Insert values in Collection
        coll.insert(doc2);

        //Fecth only first doc

        DBObject mydoc= coll.findOne();
        System.out.println("Only First Doc :"+mydoc);

        //Insert and fetch Multiple doc

        for (int i = 0; i < 10; i++) {
			coll.insert(new BasicDBObject().append("count", i));
		}

        //Total documents
       cursor= coll.find();

        while(cursor.hasNext()){
        	System.out.println(cursor.next());
        }

        //Getting A Single Document with A Query
        BasicDBObject query= new BasicDBObject();
        query.put("name","Rajiv");
        cursor= coll.find(query);
        System.out.print("Getting A Single Document with A Query: \n");

        //Iterate over database results
        while(cursor.hasNext()){
        	System.out.println(cursor.next());
        }

        //Conditional Queries, documents where i > 50. We could also get a range, say 20 < i <= 30 :
        query = new BasicDBObject();
        query.put("count", new BasicDBObject("$gt", 1).append("$lte", 30));  // i.e.   20 < i <= 30

        cursor = coll.find(query);

        while(cursor.hasNext()) {
            System.out.println("Comparison =>"+cursor.next());
        }
        System.out.println("Total documents: "+coll.getCount());

		} catch (Exception e) {
			e.printStackTrace();
		}

	}

}

Optimizing SQL Queries


 I have found a nice article on Oracle query optimzation:
  D.1 Optimizing Single-Table Queries
To improve the performance of a query that selects rows of a table based on a specific column value, create an index on that column. For example, the following query performs better if the NAME column of the EMP table has an index.

SELECT * 
FROM EMP 
WHERE NAME = 'Smith'; 

D.2 Optimizing Join Queries

The following can improve the performance of a join query (a query with more than one table reference in the FROM clause).

D.2.1 Create an Index on the Join Column(s) of the Inner Table

In the following example, the inner table of the join query is DEPT and the join column of DEPT is DEPT#. An index on DEPT.DEPT# improves the performance of the query. In this example, since DEPT# is the primary key of DEPT, an index is implicitly created for it. The optimizer will detect the presence of the index and decide to use DEPT as the inner table. In case there is also an index on EMP.WORKS_IN column the optimizer evaluates the cost of both orders of execution;DEPT followed by EMP (where EMP is the inner table) and EMP followed by DEPT (where DEPT is the inner table) and picks the least expensive execution plan.

SELECT e.SS#, e.NAME, d.BUDGET
FROM EMP e, DEPT d
WHERE e.WORKS_IN = DEPT.DEPT#
AND e.JOB_TITLE = 'Manager';

D.2.2 Bypassing the Query Optimizer

Normally optimizer picks the best execution plan, an optimal order of tables to be joined. In case the optimizer is not producing a good execution plan you can control the order of execution using the HINTS feature SQL. For more information see the Oracle9i Lite SQL Reference.
For example, if you want to select the name of each department along with the name of its manager, you can write the query in one of two ways. In the first example which follows, the hint /++ordered++/ says to do the join in the order the tables appear in the FROM clause with attempting to optimize the join order.


SELECT /++ordered++/ d.NAME, e.NAME
FROM DEPT d, EMP e
WHERE d.MGR = e.SS#

or:


SELECT /++ordered++/ d.NAME, e.NAME
FROM EMP e, DEPT d
WHERE d.MGR = e.SS#

Suppose that there are 10 departments and 1000 employees, and that the inner table in each query has an index on the join column. In the first query, the first table produces 10 qualifying rows (in this case, the whole table). In the second query, the first table produces 1000 qualifying rows. The first query will access the EMP table 10 times and scan the DEPT table once. The second query will scan the EMP table once but will access the DEPT table 1000 times. Therefore the first query will perform much better. As a rule of thumb, tables should be arranged from smallest effective number rows to largest effective number of rows. The effective row size of a table in a query is obtained by applying the logical conditions that are resolved entirely on that table.

In another example, consider a query to retrieve the social security numbers and names of employees in a given location, such as New York. According to the sample schema, the query would have three table references in the FROM clause. The three tables could be ordered in six different ways. Although the result is the same regardless of which order you choose, the performance could be quite different.

Suppose the effective row size of the LOCATION table is small, for example select count(*) from LOCATION where LOC_NAME = 'New York' is a small set. Based on the above rules, the LOCATION table should be the first table in the FROM clause. There should be an index on LOCATION.LOC_NAME. Since LOCATION must be joined with DEPT, DEPT should be the second table and there should be an index on the LOC column of DEPT. Similarly, the third table should be EMP and there should be an index on EMP#. You could write this query as:

SELECT /++ordered++/ e.SS#, e.NAME
FROM LOCATION l, DEPT d, EMP e
WHERE l.LOC_NAME = 'New York' AND
l.LOC# = d.LOC AND
d.DEPT# = e.WORKS_IN;

D.3 Optimizing with Order By and Group By Clauses

Various performance improvements have been made so that SELECT statements run faster and consume less memory cache. Group by and Order by clauses attempt to avoid sorting if a suitable index is available.

D.3.1 IN subquery conversion

Converts IN subquery to a join when the select list in the subquery is uniquely indexed.
For example, the following IN subquery statement is converted to its corresponding join statement. This assumes that c1 is the primary key of table t2:


SELECT c2 FROM t1 WHERE
c2 IN (SELECT c1 FROM t2);

becomes:


SELECT c2 FROM t1, t2 WHERE t1.c2 = t2.c1;

D.3.2 ORDER BY optimization with no GROUP BY

This eliminates the sorting step for an ORDER BY clause in a select statement if ALL of the following conditions are met:

  1. All ORDER BY columns are in ascending order or in descending order.
  2. Only columns appear in the ORDER BY clause. That is, no expressions are used in the ORDER BY clause.
  3. ORDER BY columns are a prefix of some base table index.
  4. The cost of accessing by the index is less than sorting the result set.

D.3.3 GROUP BY optimization with no ORDER BY

This eliminates the sorting step for the grouping operation if GROUP BY columns are the prefix of some base table index.

D.3.4 ORDER BY optimization with GROUP BY

When ORDER BY columns are the prefix of GROUP BY columns, and all columns are sorted in either ascending or in descending order, the sorting step for the query result is eliminated. If GROUP BY columns are the prefix of a base table index, the sorting step in the grouping operation is also eliminated.

D.3.5 Cache subquery results

If the optimizer determines that the number of rows returned by a subquery is small and the query is non-correlated, then the query result will be cached in memory for better performance. Currently the number of rows is set at 2000. For example:


select * from t1 where
t1.c1 = (select sum(salary)
from t2 where t2.deptno = 100);

Difference between MongoDB and RDBMS


Why RDBMS (SQL):

  • Frequent CRUD transactions for a limited similar type of data in GBs.
  • It’s good for structured data.
  • This kind of database is tightly structured with schema and perform slower (low latency) with huge growing data.
  • RDBMS performs faster for low amount of data ( in GBs).
  • SQL DBs- Oracle, MySQL, SQLServer etc.

Why MongoDB (No-SQL):

  • Write once and read many for unstructured data.
  • It’s faster than RDBMS for growing data on clusters/cloud in TB, PB etc.
  • If there is a requirement to not update DB frequently (not mission critical), dissimilar data , then go for this.
  • No SQL DBs- MongoDB, Cassandra, NeoJ, CouchDB, Hadoop, Cloudera, MapR etc.

In nutshell, if you have simple data without any complex relationship then why to choose complex RDBMS. Both are not replacement of each other, both have their own importance. I think it will be helpful to understand use cases of these two databases.