Randomized Sort: Cassandra

Showing posts with label Cassandra. Show all posts

Sunday, December 11, 2011

Fixing libssl and libcrypto Errors in Datastax OpsCenter Startup

Update 2011-12-19: For 64-bit Amazon Linux AMI, install openssl0.9.8 by the command "sudo yum install openssl098e-0.9.8e-17.7.amzn1.x86_64". Thanks to thobbs from the datatax forum for this tip.

The AWS Linux AMI I use has openssl 1.0.0 but DataStax OpsCenter 1.3.1 requires version 0.9.8 of libssl and libcrypto. Why didn't they say so in the docs?? The worst customer experience you can give to your user base is to let your software blow up at startup like this:

Failed to load application: libcrypto.so.0.9.8: cannot open shared object file: No such file or directory

This problem was apparently reported a month ago:
http://www.datastax.com/support-forums/topic/issue-starting-opscenterd-service

But no action has been taken to correct it...sigh...

Here is how we can fix it temporarily on our own before the Cassandra devs get their acts together:

1) Install openssl 0.9.8
sudo yum install openssl098e-0.9.8e-17.7.amzn1.i686

2) Change to /usr/lib and manually create following two symbolic links:
sudo ln -s libssl.so.0.9.8e libssl.so.0.9.8
sudo ln -s libcrypto.so.0.9.8e libcrypto.so.0.9.8

Now OpsCenter will start without the dreaded ssl error.

Monday, November 21, 2011

Cassandra Range Query Using CompositeType

CompositeType is a powerful technique to create indices using regular column families instead of super families. But there is a dearth of information on how to use CompositeType in Cassandra. Introduced in 0.8.1 in May 2011 , it is a relatively new comer to Cassandra. It doesn't help that it is not even in the "official" datatype documentation on Casandra 1.0 and 0.8! This article pieces together various tidbits to bring you a complete how-to guide on programming CompositeType. The code examples will use Hector.

Let's say we want to define a column family as the following:
row key: string
column key: composite of an integer and a string
column value: string

We can define the following schema on the cli:

create column family MyCF
    with comparator = 'CompositeType(IntegerType,UTF8Type)'
    and key_validation_class = 'UTF8Type'
    and default_validation_class = 'UTF8Type';

We can also define the same schema programmatically in Hector:

// Step 1: Create a cluster
CassandraHostConfigurator chc 
      = new CassandraHostConfigurator("localhost");
Cluster cluster = HFactory.getOrCreateCluster(
                        "Test Cluster", chc);

// Step 2: Create the schema
ColumnFamilyDefinition myCfd 
      = HFactory.createColumnFamilyDefinition(
            "MyKS", "MyCF", ComparatorType.COMPOSITETYPE);
// Thanks to Shane Perry for this tip.
// http://groups.google.com/group/hector-users/
//       browse_thread/thread/ffd0895a17c7b43e)
myCfd.setComparatorTypeAlias("(IntegerType, UTF8Type)");
myCfd.setKeyValidationClass(UTF8Type.class.getName());
myCfd.setDefaultValidationClass(UTF8Type.class.getName());
KeyspaceDefinition myKs = HFactory.createKeyspaceDefinition(
      "MyKS", ThriftKsDef.DEF_STRATEGY_CLASS, 1, 
      Arrays.asList(myCfd));

// Step 3: Add schema to the cluster
cluster.addKeyspace(myKs, true);
KeySpace ks = HFactory.createKeyspace(myKs, cluster);

Now let's insert a single row with 2 columns:

String rowKey = "row1";

// First column key
Composite colKey1 = new Composite();
colKey1.addComponent(1, IntegerSerializer.get());
colKey1.addComponent("c1", StringSerializer.get());

// Second column key
Composite colKey2 = new Composite();
colKey2.addComponent(2, IntegerSerializer.get());
colKey2.addComponent("c2", StringSerializer.get());

// Insert both columns into row1 at once
Mutator<String> m 
      = HFactory.createMutator(ks, LongSerializer.get());
m.addInsertion(rowKey, "MyCF", 
      HFactory.createColumn(colKey1, "foo", 
                            new CompositeSerializer(), 
                            StringSerializer.get()));
m.addInsertion(rowKey, "MyCF", 
      HFactory.createColumn(colKey2, "bar", 
                            new CompositeSerializer(), 
                            StringSerializer.get()));
m.execute();

After the insertion, the column family should look like this table:

row1	{1, c1}	{2, c2}
row1	foo	bar

Now let's retrieve the first column using a slice query on only the first integer component of composite column key. Since Cassandra orders composite keys by components in each composite, we can construct a search range from {0, "a"} to {1, "\uFFFF} which will include {1, "c1"} but not {2, "c2"}.

SliceQuery<String, Composite, String> sq 
      =  HFactory.createSliceQuery(ks, StringSerializer(), 
                                   new CompositeSerializer(), 
                                   StringSerializer());
sq.setColumnFamily("MyCF");
sq.setKey("row1");

// Create a composite search range
Composite start = new Composite();
start.addComponent(0, IntegerSerializer.get());
start.addComponent("a", StringSerliazer.get());
Composite finish = new Composite();
finish.addComponent(1, IntegerSerializer.get());
finish.addComponent(Character.toString(Character.MAX_VALUE), 
                    StringSerliazer.get());
sq.setRange(start, finish, false, 100);

// Now search.
sq.execute();
// TODO: Parse the result to get the first column

It is unfortunate that a JavaDoc typo in the Cassandra source code prevents tools like Eclipse from displaying documentation about CompositeType. But you can always view the source online to get the precision definition and encoding scheme of CompositeType. Reading source code has been and is still the best way of learning new features in Cassandra.

Wednesday, October 26, 2011

The “initial_token” in Cassandra Means the “Very First Time”

Cassandra uses tokens to split key ranges across nodes. When a Cassandra node is started the very first time, it will check if an “initial token” is specified in cassandra.yaml; otherwise, the node will generate a token from the cluster it is joining. But how does a node know that it is being started the “very first time”? It is simple. The token is stored on the local disk and persists across process start/stop. Therefore, once a token is stored, changing the “initial_token” parameter in cassandra.yaml will have no effect. When multiple nodes have the same token, Cassandra will elect a new owner of the token, print out an warning and then continue on. The nodetool however will under-report the number of nodes in a ring because it only probes nodes that have unique tokens. It is such a common problem when making Cassandra VM images that it even gets its own FAQ on the Cassandra wiki. The only safe way to create a new token cleanly is to wipe out the data and commit logs and then restart the node.

Monday, October 17, 2011

Counting All Rows in Cassandra

Update Oct. 25, 2011: Fixed missing key type in the code fragment.

The SQL language makes counting rows deceptively simple:

SELECT count(*) from MYTABLE;

The count function in the select clause iterates through all rows retrieved from mytable to arrive at a total count. But it is an anti-pattern to iterate through all rows in a column family in Cassandra because Cassandra is a distributed datastore. By its very nature of Big-Data, the total row count of a column family may not even fit in memory on a single 32-bit machine! But sometimes when you load a large static lookup table into a column family, you may want to verify that all rows are indeed stored in the cluster. However, before you start writing code to count rows, you should remember that:

Counting by retrieving all rows is slow.
The first scan may not return the total count due to delay in replication.

Now, we know why we shouldn't iterate through all rows in Cassandra in the first place, we can proceed to write a little function to do exactly that for those rare occasions. Below is an example using Hector and the iterative method. The key space in this example uses Random Partitioner. The example function uses the Range Slice Query technique to iterate through all rows in the order of MD5 hash value of keys. Note that Cassandra uses MD5 hash interally for Random Partitioner.

   public int totalRowCount() {
      String start = null;
      String lastEnd = null;
      int count = 0;
      while (true) {
         RangeSlicesQuery<String, String, String> rsq = 
            HFactory.createRangeSlicesQuery(ksp, StringSerializer.get(),
                  StringSerializer.get(), StringSerializer.get());
         rsq.setColumnFamily("MY_CF");
         rsq.setColumnNames("MY_CNAME");
         // Nulls are the same as get_range_slices with empty strs.
         rsq.setKeys(start, null); 
         rsq.setReturnKeysOnly(); // Return column names instead of values
         rsq.setRowCount(1000); // Arbiturary default
         OrderedRows<String, String, String> rows = rsq.execute().get();
         int rowCount = rows.getCount();
         if (rowCount == 0) {
            break;
         } else {
            start = rows.peekLast().getKey();
            if (lastEnd != null && start.compareTo(lastEnd) == 0) {
               break;
            }
            count += rowCount - 1; // Key range is inclusive
            lastEnd = start;
         }
      }
      if (count > 0) {
         count += 1;
      }
      return count;
   }

Recursion would be a more elegant solution but be aware of the stack limitation in Java.

Randomized Sort