Motivation
There’s nothing more important than having a good level of observability for the service and it’s not possible without reliable and fast storage for the application’s logs. One of the most popular solutions nowadays is ELK (ElasticSearch-Logstash-Kibana), but it is not as universal as it may seem.
Problem statement
ELK is a flexible, convenient, and complex solution to collect and analyze logs. It is scalable, robust, flexible query filtering, and universal. However, there are disadvantages to using ELK:
- high memory and CPU resource consumption
- Index and search speed degradation along with increasing number of records
- full-text search index overhead
- complicated QueryDSL
All of these make me wonder: is there an alternative to ELK when we talk about logs?
Here is my list of requirements for the log-handling solution:
- fast and flexible aggregation
- fast insertion, fast select
- easy-to-setup and resource-minimal
- grafana-compatible solution
However, it's nice to have a flexible and fast tool to do regular and ad-hoc analysis. Let's be more specific about the system that I’m going to implement:
- Process 20M+ of log lines which are Nginx access log
- Get the number of bad HTTP responses (4xx, 5xx)
- Get bot and crawler counters and the URL they visit
- Get top-5 IP addresses that are suspicious bots
What is ClickHouse and why did I decide to use it
ClickHouse is a high-performance column-oriented SQL database management system for online analytical processing as it’s declared.
The most important features for me were:
- Column-oriented storage means that ClickHouse will read from disk only if needed.
- ClickHouse can leverage all available resources (CPU cores and disks) to execute even a single query.
- Data compression
- SQL Support, last but not least
Technical Solution
Github repo tutorial
You can find the tutorial along with a log generator on GitHub.
Setup ClickHouse and connect
First of all, let’s create docker-compose.yml to define both services. (https://github.com/bp72/nginxlogprocessor/blob/init-commit/docker-compose.yml)
version: '3.6'
services:
ch:
image: clickhouse/clickhouse-server
container_name: clickhouse
restart: always
volumes:
- clickhousedata:/var/lib/clickhouse/
ports:
- '8123:8123'
- '9000:9000'
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 262144
hard: 262144
# This capabilities prevents Docker from complaining about lack of those
cap_add:
- SYS_NICE
- NET_ADMIN
- IPC_LOCK
volumes:
clickhousedata:
Let’s run it and check if everything is working: connect to the ClickHouse instance using ClickHouse client from the docker container and check if available via localhost and port
> docker-compose up -d
[+] Running 3/3
⠿ Network nginxlogprocessor_default Created 0.1s
⠿ Container clickhouse Started 0.6s
To connect to the instance I prefer ClickHouse client
docker-compose exec ch clickhouse-client
ClickHouse client version 23.9.1.1854 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 23.9.1 revision 54466.
a8c8da069d94 :)
Create databases and tables
Now, when everything is set, let’s create databases and tables: database to track processed logs and database for specific service, e.g. nginx, and first table nginx.access .
One of the significant advantages of ClickHouse is SQL syntax for definitions and queries. It’s not strictly SQL-standard, but very close to it.
CREATE DATABASE IF NOT EXISTS nginx
CREATE DATABASE IF NOT EXISTS logs
CREATE TABLE IF NOT EXISTS nginx.access (
reqid String,
ts DateTime64(3),
level Enum(''=0, 'debug'=1, 'info'=2, 'warn'=3 ,'error'=4),
domain String,
uri String,
ua String,
ref String,
is_bot Boolean,
is_mobile Boolean,
is_tablet Boolean,
is_pc Boolean,
client String,
duration Float32,
response_code UInt16,
addrIPv4 Nullable(IPv4),
addrIPv6 Nullable(IPv6),
upstream_connect_time Float32,
upstream_header_time Float32,
upstream_response_time Float32
)
ENGINE MergeTree
PRIMARY KEY reqid
ORDER BY reqid
CREATE TABLE IF NOT EXISTS logs.logfiles (
filename String
)
ENGINE MergeTree
PRIMARY KEY filename
ORDER BY filename
Closer look at CREATE TABLE statement, you can see slightly different types and completely new types like Enum and IPv4. ClickHouse tries to reduce resource usage and to do that he optimizes it with cool features like Enum. It’s basically a key-value mapping of a string to an int of 8bit or 16bit, which automatically converts on insertion it converts string value to int, and on selection converts the opposite way (link).
IPv4, IPv6 are the special types to store addresses in the most optimal way (as unsigned int) and represent those in a human-readable way, so basically at the insert time you provide a string representation of IP-addr and ClickHouse does everything for you: stores it as int and server unpacked on selection.
Log Insertion
ClickHouse ideology is to insert fast. In order to do that, ClickHouse handles batch inserts better than one by one.
So the insertion script is not very complicated. readFile function yields data chunks of 50k records at max for the ClickHouse client to insert. Each chunk item represents the list of values relative to column names in the cols list
# it's not an actual code.
# the working implementation you can find at https://github.com/bp72/nginxlogprocessor
import clickhouse_connect
from config import CLICKHOUSE_HOST, CLICKHOUSE_PORT
from log import log
client = clickhouse_connect.get_client(host=CLICKHOUSE_HOST, port=CLICKHOUSE_PORT)
def loadToClickHouse(client, chunk):
cols = [
'reqid',
'ts',
'level',
'domain',
'uri',
'ua',
'ref',
'is_bot',
'is_mobile',
'is_tablet',
'is_pc',
'client',
'duration',
'response_code',
'addrIPv4',
'addrIPv6',
'upstream_connect_time',
'upstream_header_time',
'upstream_response_time',
]
client.insert('nginx.access', chunk, column_names=cols)
def processFeed(feed, client, chunk_size=10_000):
total = 0
for chunk in readFile(feed, chunk_size=chunk_size):
total += len(chunk)
loadToClickHouse(client, chunk=chunk)
log.info(f'process {feed=} inserted={len(chunk)} {total=}')
Actual execution and timing I have on my PC, you can see that parsing and insertion of the 800k records file took 21s of Python execution time. Not bad!
> .venv/bin/python ./main.py
I:2023-10-15 12:44:02 [18764] f=transport.py:1893 Connected (version 2.0, client OpenSSH_8.9p1)
I:2023-10-15 12:44:02 [18764] f=transport.py:1893 Authentication (publickey) successful!
I:2023-10-15 12:44:02 [18764] f=fetcher.py:14 connect host='*.*.*.*' port=22 user='root' password=None
I:2023-10-15 12:44:02 [18764] f=fetcher.py:18 run cmd='ls /var/log/nginx/*access*.log-*'
I:2023-10-15 12:44:02 [18764] f=fetcher.py:34 download src=/var/log/nginx/access.log-2023100812.gz dst=/tmp/access.log-2023100812.gz
I:2023-10-15 12:44:07 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=50000
I:2023-10-15 12:44:08 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=100000
I:2023-10-15 12:44:10 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=150000
I:2023-10-15 12:44:11 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=200000
I:2023-10-15 12:44:13 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=250000
I:2023-10-15 12:44:14 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=300000
I:2023-10-15 12:44:15 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=350000
I:2023-10-15 12:44:17 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=400000
I:2023-10-15 12:44:18 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=450000
I:2023-10-15 12:44:20 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=500000
I:2023-10-15 12:44:21 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=550000
I:2023-10-15 12:44:23 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=600000
I:2023-10-15 12:44:24 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=650000
I:2023-10-15 12:44:25 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=700000
I:2023-10-15 12:44:27 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=750000
I:2023-10-15 12:44:28 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=800000
I:2023-10-15 12:44:28 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=2190 total=802190
I:2023-10-15 12:44:28 [18764] f=fetcher.py:34 download src=/var/log/nginx/access.log-2023100814.gz dst=/tmp/access.log-2023100814.gz
I:2023-10-15 12:44:31 [18764] f=main.py:20 process feed='/tmp/access.log-2023100814.gz' inserted=50000 total=50000
I:2023-10-15 12:44:32 [18764] f=main.py:20 process feed='/tmp/access.log-2023100814.gz' inserted=50000 total=100000
I:2023-10-15 12:44:33 [18764] f=main.py:20 process feed='/tmp/access.log-2023100814.gz' inserted=30067 total=130067
Log analysis and problem detection
ClickHouse uses SQL to query the DB which is very comfortable for most software engineers and intuitively simple.
Let’s start with checking the number of records we have, it’s 22M.
a8c8da069d94 :) select count(1) from nginx.access;
SELECT count(1)
FROM nginx.access
Query id: f94881f3-2a7d-4039-9646-a6f614adb46c
┌──count()─┐
│ 22863822 │
└──────────┘
It’s easy to have queries with different breakdowns, which might be useful for problem detection and solution, for example, I would like to know from which IP address the host is being scanned for vulnerability.
This query demonstrates how flexible data queries can be compared to the ELK. WITH .. AS statement and IN / NOT IN, subquery, aggregation, and filtration make the ClickHouse very convenient.
a8c8da069d94 :) with baduri as (select uri, count(1) from nginx.access where response_code = 404 and uri not in ('/about/', '/favicon.ico') group by 1 having count(1) > 3 order by 2 desc limit 10) select IPv4NumToStringClassC(addrIPv4), count(1) from nginx.access where uri in (select uri from baduri) and addrIPv4 is not null group by 1 order by 2 desc limit 5
WITH baduri AS
(
SELECT
uri,
count(1)
FROM nginx.access
WHERE (response_code = 404) AND (uri NOT IN ('/about/', '/favicon.ico'))
GROUP BY 1
HAVING count(1) > 3
ORDER BY 2 DESC
LIMIT 10
)
SELECT
IPv4NumToStringClassC(addrIPv4),
count(1)
FROM nginx.access
WHERE (uri IN (
SELECT uri
FROM baduri
)) AND (addrIPv4 IS NOT NULL)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5
Query id: cf9bea33-212b-4c58-b6af-8e0aaae50b83
┌─IPv4NumToStringClassC(addrIPv4)─┬─count()─┐
│ 8.219.64.xxx │ 961 │
│ 178.128.220.xxx │ 378 │
│ 103.231.78.xxx │ 338 │
│ 157.245.200.xxx │ 324 │
│ 116.203.28.xxx │ 260 │
└─────────────────────────────────┴─────────┘
5 rows in set. Elapsed: 0.150 sec. Processed 45.73 million rows, 1.81 GB (303.88 million rows/s., 12.01 GB/s.)
Peak memory usage: 307.49 MiB.
Let’s get the top 5 most popular uri per domain. This query uses the handy LIMIT x BY <field> function.
a8c8da069d94 :) select domain, uri, count(1) from nginx.access where domain in ('example.com', 'nestfromthebest.com', 'az.org') group by 1, 2 order by 1, 3 desc limit 5 by domain
SELECT
domain,
uri,
count(1)
FROM nginx.access
WHERE domain IN ('example.com', 'nestfromthebest.com', 'az.org')
GROUP BY
1,
2
ORDER BY
1 ASC,
3 DESC
LIMIT 5 BY domain
Query id: 2acd328c-ed82-4d36-916b-8f2ecf764a9d
┌─domain──────┬─uri────────────┬─count()─┐
│ az.org │ /about/ │ 382543 │
│ az.org │ /contacts/ │ 42066 │
│ az.org │ /category/id7 │ 2722 │
│ az.org │ /category/id14 │ 2704 │
│ az.org │ /category/id2 │ 2699 │
│ example.com │ /about/ │ 381653 │
│ example.com │ /contacts/ │ 42023 │
│ example.com │ /category/id2 │ 2694 │
│ example.com │ /category/id8 │ 2688 │
│ example.com │ /category/id13 │ 2670 │
└─────────────┴────────────────┴─────────┘
┌─domain──────────────┬─uri────────────┬─count()─┐
│ nestfromthebest.com │ /about/ │ 383377 │
│ nestfromthebest.com │ /contacts/ │ 42100 │
│ nestfromthebest.com │ /category/id8 │ 2726 │
│ nestfromthebest.com │ /category/id14 │ 2700 │
│ nestfromthebest.com │ /category/id4 │ 2696 │
└─────────────────────┴────────────────┴─────────┘
15 rows in set. Elapsed: 0.062 sec. Processed 23.97 million rows, 918.43 MB (388.35 million rows/s., 14.88 GB/s.)
Peak memory usage: 98.67 MiB.
Conclusion
ClickHouse is a great tool to store and manipulate specific data like logs at a large scale. It is definitely worth further learning, and understanding, for example, nested data structure, sampling tooling, window functions, and others
I hope you enjoyed this small article and was useful for you!