When your goal is to integrate a high performance text based search engine and a large number of location dependant classified ads, one of the questions that arises is : How do you efficiently query and manipulate any number of these ads within an area of any shape and size ?
There are a number of online solutions but we wanted an in-house system, mainly for performance issues, that was as close as possible to the Exalead search layer. With the rest of the engineering team at Yakaz.com, we chose to actually put it inside the search layer. Out of the box, Exalead supports numerical queries with the regular comparison operators (<, >...), so using a classical coordinate system (latitude/longitude, Mercator...) is feasible. Fairly obviously though, this cannot efficiently be used in production and, as speed is a paramount focal point, we had to come up with a different way of doing things.
The most common spatial indexing technique is the R-tree (see R-tree@wikipedia) and derivatives. Although efficient, the "shape" of the tree depends on the data and evolves when data is added or removed. But remember, this has to be fed to a full-text search engine, we cannot rebalance the tree whenever a new point is added. Therefore we need to store and retrieve objects with an a priori knowledge of the tree structure. The next obvious step is to look at the Quadtree structure, and we took a very good look...
The Quadtree Representation of the Earth
The quadtree is a tree data structure in which every node has exactly 4 children (see Quadtree@wikipedia for details). It can be used to divide a rectangular surface into a predefined set cells of known sizes at multiple levels. To go from one level of the tree to the next, each cell is divided into 4 cells of same size that completely cover the area of the parent cell. By using the complete tree, we can associate any area or point of the original rectangle to a specific set of cells at any given level. On the right is a visual example of such a structure (taken from the wikipedia article on Quadtrees).
This is the idea behind the scheme used for online mapping systems (openstreetmap, google maps, yahoo maps...) where the Earth is represented as a square (in an adapted version of the Mercator projection). At the lowest zoom, the root of the quadtree, there is typically a single cell (or tile) that contains the whole planet. Tiles are images that are typically 256 by 256 pixels, see below for an example taken from openstreetmap
When you zoom in, the tiles remain images of the same size and the Earth is divided among all the tiles at the same zoom level. For example, at level 1, the first level down into the quadtree, we have the 4 images shown below (taken from openstreetmap), which put together form a 2 fold zoom of the previous image at zoom level 0.
Onwards, the more we zoom the deeper we go in the quadtree. For example, the Sather Tower at UC Berkeley (a great place to be at sunset on a nice summer day) is located, at level 18, in the node 42046/101234. Below is the corresponding cell taken again from openstreetmap
More on this can be found on openstreetmap's wiki page.
Let's take a minute for a word on projections.
We already mentionned that we wish to represent the Earth as a square for conveniance. As you all know, in real life, it's closer to a sphere, so we need a way of going from one representation to the other.
Most are familiar with the latitude/longitude coordinate system which is most often mentionned in degrees (see Geographic_coordinate_system@wikipedia). The UC Berkeley Sather Tower has coordinates latitude 37.872063° and longitude -122.257839° in decimal degrees (which can also be represented as latitude N37°52'19" and longitude W122°15'28" in the degrees/minutes/seconds representation). In degrees latitudes vary from -90° (South Pole) to +90° (North Pole) and longitudes from -180° to +180° (both being in fact the same meridian sometimes called the dateline), with 0° the Greenwich Meridian in England (where that dubious GMT timezone lies).
To flatten out the Earth to the map friendly representation that is the square, we apply a mathematical transformation known as the Mercator projection (Mercator_projection@wikipedia). This is actually the way most maps are made, so this is in fact the way a lot of you are used to seeing things layed out on road maps (if you still use them... :) or on your GPS. Without going into details, an interesting fact on this projection is that it doesn't conserve sizes. That's why Alaska appears to be twice the size it actually is on a map of North America. This can be seen when you compare the two following images. On the left (taken from Alaska@wikipedia) Alaska is overlayed to the 48 contiguous states of the USA, everything here being in correct relative sizes. On the right (extracted from openstreetmap) a montage compares again Alaska and the contiguous states as they are shown in a Mercator projection. In the left image you can see that Alaska fits nicely within the layout of continental USA whereas in the right image the "Last Frontier" state seems to be about twice it's real size.
Furthermore, there are two important facts one needs to remember about this projection. The first is that to make a square out of the sphere that is Earth, we must restrict latitudes to the range [-85.0511,85.0511] (in degrees), luckily, besides the poles, not much lies outside this range. The second fact is that the points on either side of the dateline which in reality are very close, are sent to the two opposing sides of the square, this will have to be dealt with accordingly. Coordinates in both directions now lie in the range [-π, π] after transformation .
From this projection, we naturally obtain the first level of the quadtree as the 4 parts divided by the equator and the Greenwich meridian (e.g. the four images introduced earlier and taken from openstreetmap to represent zoom level 1). Successively dividing each part into four equal parts we obtain the gist of the quadtree representation of the Earth. There is a natural numbering scheme that arises from this description when you notice that at level 'n' there are (2n-1) divisions in each direction. Therefore a single cell can be represented by a 3-uple (n,i,j) where 'n' represents the zoom level and the integers (i,j) range from (0,0) to (2n-1, 2n-1) and represent the coordinates of the cell.
Furthermore, a useful property of this grid description, is that passing from the cell with coordinates (n,i,j) at level n to the coordinates of the included sub-cells is immediate. This correspondence between cell c(n,i,j) and the coordinates of its sub-cells at level n+1 is summarized in the following table as
| (n+1,2i,2j) | (n+1,2i+1,2j) |
| (n+1,2i,2j+1) | (n+1,2i+1,2j+1) |
which will be very useful in the next section.
As a final remark in this section, at level 25 a cell from the quadtree representation of the Earth has a width of about 1 meter. It is then easy to imagine how points can be approximated within this description.
As a number of usage cases of the quadtree, we rely on a description similar to that of the Morton number description, but we will not go into details on our particular methodology for this.
Using the quadtree to describe an area
At this point we have described what we use, now let's move on to the how. How does this help query points ?
Starting easy, we consider a rectangular portion 'R' of the map, that is 4 points with 4 independant coordinates taken in a certain order. The idea is then to find a set of the quadtree cells that best summarizes this area. Once we have made this spatial summary of our zone, we will have a set of cells within which all points of the rectangle lie. This information is then passed on in the form of a query on indexed cell data which will answer our initial question.
As the first step in this process, we need to decide what size cells we are going to be using. To do this, we rely on the definition of the quadtree where we notice that the zoom level 'n' and the extent of an individual cell at this zoom level in the Mercator projection 'e' are linked by
Extending this definition to our needs, we then define a float 'z' from the size 's = max(width, height)' of the rectangular selection 'R' we made earlier (in Mercator projection) as
If the rectangle crosses the dateline, we simply divide the original rectangle into to two smaller ones, each lying on either side of the dateline and apply the steps individually to each one.
This float 'z' can be seen as an extension of the idea of zoom level to estimate the size of a rectangle (and can be extended to any shape). Once this value has been calculated we define a new integer 'nmax = floor(z) + u', where, by empirical decision, we usually take 'u = 3'.
This gives us 'nmax', the level of the smallest cells we are going to be using. Once this is chosen we construct the set '{C}' of cells at level 'nmax' that minimum bounds the rectangle (i.e. it is the smallest set that entirely contains the whole rectangle). We could make the decision to stop here and use this set '{C}', but the cost of passing long sets of cells down the line by far outweighs the cost of the next step.
Therefore, this next step, which we call the "remapping algorithm", takes the set '{C}' and tries to build the smallest set of cells that occupy the same portion of the map as '{C}'.
This algorithm employs the sub-cell coordinate property of the quadtree described earlier, but in reverse. That is, if we have four adjacent cells in '{C}': c(n,i,j), c(n,i+1,j), c(n,i,j+1) and c(n,i+1,j+1) where both i and j are even then we can replace these 4 cells with the single cell c(n-1,i/2,j/2).
Applying this over the whole set and recursively, we move down the tree and we stop when we have finished with level 'nmin = floor(z)'. We then return the final set '{C}'. This final set usually has quite a number fewer elements than the first, thus resulting in our desired outcome. Below is an example of this applied to a rectangle that surrounds the city of Paris, France (overlayed to google map using the javascript API).
Finally, we extend this to apply it to any shape, not just rectangles.The only thing that needs to be modified is the way the initial set '{C}' is chosen.
We start with a single closed polygon description of a region, for example the city limits of Paris. From the minimum bounding box of this geometry (i.e. the rectangle that best fits this geometry), we calculate its zoom level in the same fashion as with the procedure applied to the simple rectangle. This gives us 'z' and again we calculate the integer 'nmax = floor(z) + u'.
Empirically, we have noticed we need a more precise fit than in the simple version. We usually take 'u = 7' as a good trade-off between precision, compaction of resulting cell set and computation time in this more precise version. The polygon description we have is often a summary of the complete contour, therefore we use Bresenham's line algorithm (the algorithm@wikipedia) to complete it as if we were dealing with a raster. This contour is then filled using the winding number algorithm (a description can be found here), this results in the initial set of cells '{C}' which is passed to the same remapping algorithm. The resulting final set, again for Paris, is shown bellow (again overlayed to google map).
The syntax and form of the query itself and the way data is stored in the index are outside the scope of this post. Suffice it to say that there are a few more stations on the assembly line than the ones described here. Let's not give away all our secrets in a single blog post... :)
In conclusion, by using this spatial description system, we have an efficient way of obtaining all points within a certain zone. It also allows for indexed "fuzzy" matching between points and geometries or between geometries. That is, it's an efficient way of spatially indexing data with a chosen un-precision. In reverse, it has potential for an interesting geoquery system. Indeed, by taking advantage of the "fuzzy" matching, we can apply distance filtering at minimum cost.
Comments