Monday, October 17, 2011

Counting All Rows in Cassandra

Update Oct. 25, 2011: Fixed missing key type in the code fragment.

The SQL language makes counting rows deceptively simple:
SELECT count(*) from MYTABLE;
The count function in the select clause iterates through all rows retrieved from mytable to arrive at a total count. But it is an anti-pattern to iterate through all rows in a column family in Cassandra because Cassandra is a distributed datastore. By its very nature of Big-Data, the total row count of a column family may not even fit in memory on a single 32-bit machine! But sometimes when you load a large static lookup table into a column family, you may want to verify that all rows are indeed stored in the cluster. However, before you start writing code to count rows, you should remember that:
  • Counting by retrieving all rows is slow.
  • The first scan may not return the total count due to delay in replication.
Now, we know why we shouldn't iterate through all rows in Cassandra in the first place, we can proceed to write a little function to do exactly that for those rare occasions. Below is an example using Hector and the iterative method. The key space in this example uses Random Partitioner. The example function uses the Range Slice Query technique to iterate through all rows in the order of MD5 hash value of keys. Note that Cassandra uses MD5 hash interally for Random Partitioner.
   public int totalRowCount() {
      String start = null;
      String lastEnd = null;
      int count = 0;
      while (true) {
         RangeSlicesQuery<String, String, String> rsq = 
            HFactory.createRangeSlicesQuery(ksp, StringSerializer.get(),
                  StringSerializer.get(), StringSerializer.get());
         rsq.setColumnFamily("MY_CF");
         rsq.setColumnNames("MY_CNAME");
         // Nulls are the same as get_range_slices with empty strs.
         rsq.setKeys(start, null); 
         rsq.setReturnKeysOnly(); // Return column names instead of values
         rsq.setRowCount(1000); // Arbiturary default
         OrderedRows<String, String, String> rows = rsq.execute().get();
         int rowCount = rows.getCount();
         if (rowCount == 0) {
            break;
         } else {
            start = rows.peekLast().getKey();
            if (lastEnd != null && start.compareTo(lastEnd) == 0) {
               break;
            }
            count += rowCount - 1; // Key range is inclusive
            lastEnd = start;
         }
      }
      if (count > 0) {
         count += 1;
      }
      return count;
   }
Recursion would be a more elegant solution but be aware of the stack limitation in Java.

No comments:

Post a Comment