Good to know

Database size

A single database record will use 200+ bytes if the paths in the requests are short in average (~25 characters), and will grow on longer paths.

Compute

As of today, loading the database is single-threaded. Depending on the disk throughput, it will use a single CPU at 100%.

Running queries, on the other hand, is done in parallel using subprocesses. Each of them will load a single CPU to up to 100%, again depending on disk throughput.

In the default setting (i.e. w/o specifying --procs). it will spawn as much subprocesses as there are CPUs in the system. This can easily load your system to its limits.

Disk

Depending on the size, the database itself can get quite big. A busy 12-node HCP generated a 7.3GB log package (compressed) for a single week. That translated into a 74GB database, holding 384.1 million log records.

Due to the fact that there are no indexes configured for the database (many different ones would be needed to facilitate all queries), these indexes are created (and loaded) on the fly when running queries. They will end up in your systems usual tmp folder - if that one doesn’t have enough free capacity, the queries will fail. Some of the more complex queries will require as much disk space as the database itself.

Now think of running some of these queries in parallel, each creating its own temp indexes. While analyzing huge databases, this will likely overload your system, unless you have a lot of disk space.

If hcprequestanalytics prints error messages about filesystem or database full, you can make sure that an appropriately sized folder is used for the temporary database indexes by setting this environment variable before running hcprequestanalytics:

$ export SQLITE_TMPDIR=/wherever/you/have/enough/space

Make sure to replace /wherever/you/have/enough/space with a path that matches your systems reality, of course!

Memory

Especially the percentile() aggregate function needs a lot of memory when used in queries against huge databases, because it has to hold a list of all values to be able to calculate the percentile, at the end.

The mentioned req_httpcode query has been observed to use more than 35GB of real memory on the database mentioned above.

Trying to use more memory than available will usually kill a query. Running multiple queries in parallel, each of them allocation a huge amount of memory will quickly bring you to that point, and all queries will fail.

Conclusion

A simple task -analyzing http log files- can be much more challenging than expected.

Compute, Disk, Memory and parallelism are all relevant as soon as the amount of data exceeds a pretty low barrier. Depending on the amount of log data to analyze, these needs have to be balanced.

The only strategies here are:

  • use the percentile() aggregate function sparingly, to save memory
  • run less queries in parallel than the no. of CPUs would allow (--procs 2, for example)
  • or even run queries one at a time (turn off multi-processing by --procs 1)

or:

  • throw in more hardware: CPUs, Memory, Disk capacity