Monday, November 21, 2011

Cassandra Range Query Using CompositeType

CompositeType is a powerful technique to create indices using regular column families instead of super families. But there is a dearth of information on how to use CompositeType in Cassandra. Introduced in 0.8.1 in May 2011 , it is a relatively new comer to Cassandra. It doesn't help that it is not even in the "official" datatype documentation on Casandra 1.0 and 0.8! This article pieces together various tidbits to bring you a complete how-to guide on programming CompositeType. The code examples will use Hector.

Let's say we want to define a column family as the following:
row key: string
column key: composite of an integer and a string
column value: string

We can define the following schema on the cli:

create column family MyCF
    with comparator = 'CompositeType(IntegerType,UTF8Type)'
    and key_validation_class = 'UTF8Type'
    and default_validation_class = 'UTF8Type';

We can also define the same schema programmatically in Hector:

// Step 1: Create a cluster
CassandraHostConfigurator chc 
      = new CassandraHostConfigurator("localhost");
Cluster cluster = HFactory.getOrCreateCluster(
                        "Test Cluster", chc);

// Step 2: Create the schema
ColumnFamilyDefinition myCfd 
      = HFactory.createColumnFamilyDefinition(
            "MyKS", "MyCF", ComparatorType.COMPOSITETYPE);
// Thanks to Shane Perry for this tip.
// http://groups.google.com/group/hector-users/
//       browse_thread/thread/ffd0895a17c7b43e)
myCfd.setComparatorTypeAlias("(IntegerType, UTF8Type)");
myCfd.setKeyValidationClass(UTF8Type.class.getName());
myCfd.setDefaultValidationClass(UTF8Type.class.getName());
KeyspaceDefinition myKs = HFactory.createKeyspaceDefinition(
      "MyKS", ThriftKsDef.DEF_STRATEGY_CLASS, 1, 
      Arrays.asList(myCfd));

// Step 3: Add schema to the cluster
cluster.addKeyspace(myKs, true);
KeySpace ks = HFactory.createKeyspace(myKs, cluster);

Now let's insert a single row with 2 columns:

String rowKey = "row1";

// First column key
Composite colKey1 = new Composite();
colKey1.addComponent(1, IntegerSerializer.get());
colKey1.addComponent("c1", StringSerializer.get());

// Second column key
Composite colKey2 = new Composite();
colKey2.addComponent(2, IntegerSerializer.get());
colKey2.addComponent("c2", StringSerializer.get());

// Insert both columns into row1 at once
Mutator<String> m 
      = HFactory.createMutator(ks, LongSerializer.get());
m.addInsertion(rowKey, "MyCF", 
      HFactory.createColumn(colKey1, "foo", 
                            new CompositeSerializer(), 
                            StringSerializer.get()));
m.addInsertion(rowKey, "MyCF", 
      HFactory.createColumn(colKey2, "bar", 
                            new CompositeSerializer(), 
                            StringSerializer.get()));
m.execute();

After the insertion, the column family should look like this table:

row1 {1, c1} {2, c2}
foo bar

Now let's retrieve the first column using a slice query on only the first integer component of composite column key. Since Cassandra orders composite keys by components in each composite, we can construct a search range from {0, "a"} to {1, "\uFFFF} which will include {1, "c1"} but not {2, "c2"}.

SliceQuery<String, Composite, String> sq 
      =  HFactory.createSliceQuery(ks, StringSerializer(), 
                                   new CompositeSerializer(), 
                                   StringSerializer());
sq.setColumnFamily("MyCF");
sq.setKey("row1");

// Create a composite search range
Composite start = new Composite();
start.addComponent(0, IntegerSerializer.get());
start.addComponent("a", StringSerliazer.get());
Composite finish = new Composite();
finish.addComponent(1, IntegerSerializer.get());
finish.addComponent(Character.toString(Character.MAX_VALUE), 
                    StringSerliazer.get());
sq.setRange(start, finish, false, 100);

// Now search.
sq.execute();
// TODO: Parse the result to get the first column

It is unfortunate that a JavaDoc typo in the Cassandra source code prevents tools like Eclipse from displaying documentation about CompositeType. But you can always view the source online to get the precision definition and encoding scheme of CompositeType. Reading source code has been and is still the best way of learning new features in Cassandra.

7 comments:

  1. after // Step 3: Add schema to the cluster
    if i need to add another super column family in this schema so how it is possible?

    ReplyDelete
  2. Super column family is frowned up these days. The suggestion I got from experienced Cassandra devs is to use Composite type instead. Someone even told me that there were discussions to deprecate Super column family in a future Cassandra release. Ed Anuff did a great post on composite vs super-family here:
    http://www.anuff.com/2011/02/indexing-in-cassandra.html

    ReplyDelete
  3. If I use CompositeType for row keys as key_validation_class = 'CompositeType(UTF8Type, UTF8Type) ' with RamdonPartitioner then how to perform range queries where only the first component is known and you have to fetch all the rows matching the first row? The second component can be anything. When I set only the first component for both start and end keys in RangeSliceQuery.setKeys method I don't get anything back

    ReplyDelete
    Replies
    1. I meant matching the first component not first row

      Delete
    2. I have the same issue... Have anyone found the solution? Can you please share ?

      Delete
    3. RandomPartitioner orders row keys by their MD5 values, not by their natural ordering. So no, there is no easy way to do range scan on rows "partitioned" by RandomParitioner. If you truly need row-based range scan, try OrderPreservingPartitioner. But be aware of hot spots when using any non-RandomPartitioner partitioner!

      Delete
  4. how to make this work for TimeUUIDType instead of integer type in composite key...what to do if we dont want to include timeUUID in range query...it only works if I put TimeUUID as second part of composite key

    ReplyDelete