The SQL language makes counting rows deceptively simple:
SELECT count(*) from MYTABLE;The count function in the select clause iterates through all rows retrieved from mytable to arrive at a total count. But it is an anti-pattern to iterate through all rows in a column family in Cassandra because Cassandra is a distributed datastore. By its very nature of Big-Data, the total row count of a column family may not even fit in memory on a single 32-bit machine! But sometimes when you load a large static lookup table into a column family, you may want to verify that all rows are indeed stored in the cluster. However, before you start writing code to count rows, you should remember that:
- Counting by retrieving all rows is slow.
- The first scan may not return the total count due to delay in replication.
public int totalRowCount() { String start = null; String lastEnd = null; int count = 0; while (true) { RangeSlicesQuery<String, String, String> rsq = HFactory.createRangeSlicesQuery(ksp, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); rsq.setColumnFamily("MY_CF"); rsq.setColumnNames("MY_CNAME"); // Nulls are the same as get_range_slices with empty strs. rsq.setKeys(start, null); rsq.setReturnKeysOnly(); // Return column names instead of values rsq.setRowCount(1000); // Arbiturary default OrderedRows<String, String, String> rows = rsq.execute().get(); int rowCount = rows.getCount(); if (rowCount == 0) { break; } else { start = rows.peekLast().getKey(); if (lastEnd != null && start.compareTo(lastEnd) == 0) { break; } count += rowCount - 1; // Key range is inclusive lastEnd = start; } } if (count > 0) { count += 1; } return count; }Recursion would be a more elegant solution but be aware of the stack limitation in Java.
No comments:
Post a Comment