This article is reproduced from “Caiyun Weather Geo Query Optimization: The nearest N points”
Let’s start from a real-world scenario:
How to find the K nearest national observation stations to the 768 Creative Industry Park in Haidian District, Beijing?
The simplest approach is to iterate over all candidate stations, compute the distance between each station and the 768 park, and then pick the K smallest distances. The code is straightforward. But the problem is it is slow.
Why does performance matter? In our real system, the distance cannot be replaced by the 2D Euclidean distance $\sqrt{(x_{1} - x_{2})^2 + (y_{1} - y_{2})^2}$ because the Earth is a curved surface. When distances become large enough, the Earth’s curvature affects distances along the surface.
The most accurate distance formula is Vincenty’s formulae. Its computation is quite involved and uses many trigonometric functions:
Another alternative is the haversine formula:
It is slightly less accurate, but for meteorological data the spatial precision is sufficient. However, this formula still contains several trigonometric operations. Although modern CPUs have optimized instructions for these, trigonometric functions are still relatively slow compared to additions, subtractions, multiplications and divisions. When the number of candidate stations reaches tens of thousands, these computations produce a significant CPU cost.
The first optimization we applied was a coarse latitude/longitude bounding check: only coordinates that fall within a certain bounding box proceed to the haversine distance calculation; others are skipped. After this optimization, a typical station retrieval and sorting took about 200,000 ns, and some slower queries reached 400,000 ns.
For a statically compiled language like Go, these numbers are still a bit high. We wanted to reduce this overhead because at peak times the service performs tens of thousands of such operations per second; optimizing speed helps control cost. So we profiled the program with pprof and found the real CPU hotspot was memory allocation. The allocation was caused by a for-loop iterating a slice with length in the tens of thousands, producing considerable allocation overhead. The core optimization goal became avoiding iteration over the entire dataset — we needed a geographic index to improve efficiency.
We surveyed various community RTree/QuadTree implementations. Some still had performance problems in certain scenarios; for example, some queries cost around 100,000 ns, and their KNN implementations did not provide sorted results — you still had to compute and sort distances yourself. So we decided to implement a simple, easy-to-understand, and fast approach ourselves. If you prefer RTree, don’t worry: RTree will appear in later sections.
The idea was inspired by map tiles used in slippy maps. We compute for each station which tile (x, y) it belongs to at a particular zoom level. By adding ±1/0 to the x and y indices we can get the eight neighboring tiles as well:
|
|
Thanks to the tile index structure, apart from the initial tile calculation (which still requires trigonometric operations), subsequent operations are only additions and subtractions.
For convenience we store the tile index for a station in a map using the
reversed order z,y,x
. This way we avoid brute-forcing all stations during
filtering and only need to look up 9 keys.
In our current station dataset, aside from the contiguous United States (which is relatively uniformly distributed), station distributions in other countries and regions are highly uneven: the density in central-eastern China is much higher than in western China; Western Europe has higher station density than Eastern Europe, and so on. So in production we use a two-level pre-indexing mechanism: when no data is found at a high-resolution tile, we fall back to searching at a coarser scale.
With this optimization, our station queries take only about 900 ns to 3,000 ns, and service CPU overhead decreased by roughly 10%. In our roadmap, we also have an RTree-based indexing option — for datasets with varying spatial resolution or very sparse data, RTree will be used by default to provide adaptive coverage.