The SQL language makes counting rows deceptively simple:
SELECT count(*) from MYTABLE;The count function in the select clause iterates through all rows retrieved from mytable to arrive at a total count. But it is an anti-pattern to iterate through all rows in a column family in Cassandra because Cassandra is a distributed datastore. By its very nature of Big-Data, the total row count of a column family may not even fit in memory on a single 32-bit machine! But sometimes when you load a large static lookup table into a column family, you may want to verify that all rows are indeed stored in the cluster. However, before you start writing code to count rows, you should remember that:
- Counting by retrieving all rows is slow.
- The first scan may not return the total count due to delay in replication.
public int totalRowCount() {
String start = null;
String lastEnd = null;
int count = 0;
while (true) {
RangeSlicesQuery<String, String, String> rsq =
HFactory.createRangeSlicesQuery(ksp, StringSerializer.get(),
StringSerializer.get(), StringSerializer.get());
rsq.setColumnFamily("MY_CF");
rsq.setColumnNames("MY_CNAME");
// Nulls are the same as get_range_slices with empty strs.
rsq.setKeys(start, null);
rsq.setReturnKeysOnly(); // Return column names instead of values
rsq.setRowCount(1000); // Arbiturary default
OrderedRows<String, String, String> rows = rsq.execute().get();
int rowCount = rows.getCount();
if (rowCount == 0) {
break;
} else {
start = rows.peekLast().getKey();
if (lastEnd != null && start.compareTo(lastEnd) == 0) {
break;
}
count += rowCount - 1; // Key range is inclusive
lastEnd = start;
}
}
if (count > 0) {
count += 1;
}
return count;
}
Recursion would be a more elegant solution but be aware of the stack limitation in Java.
No comments:
Post a Comment