Randomized Sort: November 2011

Monday, November 21, 2011

Cassandra Range Query Using CompositeType

CompositeType is a powerful technique to create indices using regular column families instead of super families. But there is a dearth of information on how to use CompositeType in Cassandra. Introduced in 0.8.1 in May 2011 , it is a relatively new comer to Cassandra. It doesn't help that it is not even in the "official" datatype documentation on Casandra 1.0 and 0.8! This article pieces together various tidbits to bring you a complete how-to guide on programming CompositeType. The code examples will use Hector.

Let's say we want to define a column family as the following:
row key: string
column key: composite of an integer and a string
column value: string

We can define the following schema on the cli:

create column family MyCF
    with comparator = 'CompositeType(IntegerType,UTF8Type)'
    and key_validation_class = 'UTF8Type'
    and default_validation_class = 'UTF8Type';

We can also define the same schema programmatically in Hector:

// Step 1: Create a cluster
CassandraHostConfigurator chc 
      = new CassandraHostConfigurator("localhost");
Cluster cluster = HFactory.getOrCreateCluster(
                        "Test Cluster", chc);

// Step 2: Create the schema
ColumnFamilyDefinition myCfd 
      = HFactory.createColumnFamilyDefinition(
            "MyKS", "MyCF", ComparatorType.COMPOSITETYPE);
// Thanks to Shane Perry for this tip.
// http://groups.google.com/group/hector-users/
//       browse_thread/thread/ffd0895a17c7b43e)
myCfd.setComparatorTypeAlias("(IntegerType, UTF8Type)");
myCfd.setKeyValidationClass(UTF8Type.class.getName());
myCfd.setDefaultValidationClass(UTF8Type.class.getName());
KeyspaceDefinition myKs = HFactory.createKeyspaceDefinition(
      "MyKS", ThriftKsDef.DEF_STRATEGY_CLASS, 1, 
      Arrays.asList(myCfd));

// Step 3: Add schema to the cluster
cluster.addKeyspace(myKs, true);
KeySpace ks = HFactory.createKeyspace(myKs, cluster);

Now let's insert a single row with 2 columns:

String rowKey = "row1";

// First column key
Composite colKey1 = new Composite();
colKey1.addComponent(1, IntegerSerializer.get());
colKey1.addComponent("c1", StringSerializer.get());

// Second column key
Composite colKey2 = new Composite();
colKey2.addComponent(2, IntegerSerializer.get());
colKey2.addComponent("c2", StringSerializer.get());

// Insert both columns into row1 at once
Mutator<String> m 
      = HFactory.createMutator(ks, LongSerializer.get());
m.addInsertion(rowKey, "MyCF", 
      HFactory.createColumn(colKey1, "foo", 
                            new CompositeSerializer(), 
                            StringSerializer.get()));
m.addInsertion(rowKey, "MyCF", 
      HFactory.createColumn(colKey2, "bar", 
                            new CompositeSerializer(), 
                            StringSerializer.get()));
m.execute();

After the insertion, the column family should look like this table:

row1	{1, c1}	{2, c2}
row1	foo	bar

Now let's retrieve the first column using a slice query on only the first integer component of composite column key. Since Cassandra orders composite keys by components in each composite, we can construct a search range from {0, "a"} to {1, "\uFFFF} which will include {1, "c1"} but not {2, "c2"}.

SliceQuery<String, Composite, String> sq 
      =  HFactory.createSliceQuery(ks, StringSerializer(), 
                                   new CompositeSerializer(), 
                                   StringSerializer());
sq.setColumnFamily("MyCF");
sq.setKey("row1");

// Create a composite search range
Composite start = new Composite();
start.addComponent(0, IntegerSerializer.get());
start.addComponent("a", StringSerliazer.get());
Composite finish = new Composite();
finish.addComponent(1, IntegerSerializer.get());
finish.addComponent(Character.toString(Character.MAX_VALUE), 
                    StringSerliazer.get());
sq.setRange(start, finish, false, 100);

// Now search.
sq.execute();
// TODO: Parse the result to get the first column

It is unfortunate that a JavaDoc typo in the Cassandra source code prevents tools like Eclipse from displaying documentation about CompositeType. But you can always view the source online to get the precision definition and encoding scheme of CompositeType. Reading source code has been and is still the best way of learning new features in Cassandra.

Wednesday, November 16, 2011

GuiceFilter and Static Resources

GuiceContextListener and GuiceFilter can eliminate servlet mappings completely from the web.xml file. They also preserve the default servlet handling logic by re-routing a URL request back to the default servlet when none of Guice-configured servlet matches the requested URL. This technique can be used to serve static resources from an Guice-configured web app. For example, assuming this simple web.xml that routes everything through Guice:

<webapp>
  <listener>
    <listener-class>com.myapp.MyGuiceContextListener</listener-class>
  </listener>
  <filter>
    <filter-name>guiceFilter</filter-name>
    <filter-class>com.google.inject.servlet.GuiceFilter</filter-class>
  </filter>
</webapp>

If we have a file named myIndex.html at the same directory level with WEB-INF in the web app layout, we can then easily request this file by using an URL like this:

http://mydomain/myIndex.html

In this case, the GuiceFilter is smart enough to reroute the request to the default servlet for serving the myIndex.html file.

Thursday, November 3, 2011

Rules for Configuring ThreadPoolExecutor Pool Size

I came across a blog Rules of ThreadPoolExecutor Pool Size. In that blog, the author explained reasonably well how TPE creates new threads in relation to thread pool size. But the author incorrectly stated that the TPE would always fill a task queue first before creating a new thread. Thread creation is determined by queuing strategy. Choosing a direct handoff strategy will achieve the author's "user anticipated way". But there is risk in that particular way. Here is a table listing the general pros and cons of each queuing strategy:

Queuing Strategy	Example	Design Trade-off	Worst Case	Usage
Direct Handoff	SynchronousQueue	Zero task queue but can create unlimited number of threads.	OutOfMemory	For tasks that may have interdependency. For example, task i changes a global state that affects the execution of task j.
Unbounded Queue	LinkedBlokingQueue	Limit the number of threads by the max pool size but allows submission of unlimited tasks.	OutOfMemory	For tasks that are completely independent of each other.
BoundedQueue	ArrayBlockingQueue	Limit both the number of threads and the queue size.	Low throughput from imbalanced pool and queue sizes.	For burning midnight oil.

Randomized Sort