20 Apr2019
In 2011-2012 I dealt with FIFA Street, followed by FIFA EURO 2012 DLC and finally FIFA 13 – all of these games were based on the exact same codebase, and this codebase was HUGE. Offered an unidentified codebase, you need a way to rapidly get around it – since you do not know the code, you resort to search-based navigation, aka grep. Using Visual Studio Ctrl Shift F search on a HDD on a codebase this size suggests that every search takes minutes. This was frustrating and as such I chose to solve this issue.
I wanted a tool that was much faster than existing options 1 and was able to carry out both actual and regular expression searches at comparable speed. At the time, the only existing tool that provided near-instantaneous outcomes was Google Code Search – sadly, the efficiency on case-insensitive queries or some kinds of routine expressions at the time wasn’t great 2 Thus I chose to make a new tool, qgrep 3 This short article will review the design and efficiency optimizations that make qgrep really quickly.
How quickly are we discussing?
Remember, qgrep was composed in 2012, so the competition at the time was grep, ag, Visual Studio Ctrl Shift F, and the hardware target was a multi-core CPU with an HDD. Today the golden standard of efficiency is set by ripgrep, which runs much faster than the options, and the standard is to utilize an SSD. Still, qgrep is much much faster.
As an example, let’s try to look for an easy query, vkCmdDrawInde. * KHR = 0
, in UE4 codebase which is similar in size today to FIFA codebase from2012 We’ll run qgrep and ripgrep twice: very first time seeks a reboot so the filesystem cache is cold, and 2nd time is right away after that. The timings are done using my desktop system with i7 8700 K (6 cores 12 threads) and Samsung 970 EVO SSD (which is an M. 2 NVMe drive) on Windows 10, utilizing most current ripgrep/qgrep x64 builds. Here’s the hot run:
C: work qgrep > build Release_x64 qgrep search ue4 S "vkCmdDrawInde. * KHR = 0".
C:/ work/unrealengine/Engine/ Source/ThirdParty/Vulkan/ Include/vulkan/vulkan. hpp: 44478: PFN_vkCmdDrawIndexedIndirectCountKHR vkCmdDrawIndexedIndirectCountKHR = 0;.
Browse total, discovered 1 matches in 0.03 sec.
C: work unrealengine > rg-- stats "vkCmdDrawInde. * KHR = 0".
Engine Source ThirdParty Vulkan Include vulkan vulkan.hpp.
53924: PFN_vkCmdDrawIndexedIndirectCountKHR vkCmdDrawIndexedIndirectCountKHR = 0;.
1 matches.
1 matched lines.
1 files included matches.
118457 files searched.
143 bytes printed.
1237500936 bytes searched.
3.651454 seconds spent searching.
1.216025 seconds.
And here’s the table of timings in seconds; here we likewise run qgrep with a b
option that will be explained later (it represents bruteforce
):
tool | cold | hot |
---|---|---|
qgrep | 0.42 | 0.03 |
qgrep bruteforce | 0.66 | 0.08 |
ripgrep | 102 | 1.2 |
It’s possible that if in 2012 I utilized SSDs and ripgrep was offered, I would not have gone through the problem of making a brand-new tool – however the discomfort threshold was crossed so I did and here we are. However how is it possible to make a code search tool that’s a lot faster than ripgrep? Why, by cheating, of course.
Removing I/O bottlenecks
The single biggest source of efficiency problems at the time was disk I/O and I/O-related system calls. A recursive grep tool should begin by recursively passing through the target folder; as brand-new files are come across, it needs to read their contents and perform regular or actual search on the results, and print the matches if any. In theory, the filesystem in-memory cache is expected to make subsequent searches fast; on FIFA codebase, nevertheless, both file hierarchy and file contents was never completely in the filesystem cache, especially as you worked on the code (for example, changing to the internet browser to google something would likely evict big parts of the search data set). Even when filesystem hierarchy was in the cache, obtaining the hierarchy wasn’t really efficient either due to a large number of kernel operations required. As you can see by looking at ripgrep results, the impact is very significant even on SSDs; on HDDs we’re talking about waiting on a minute for the search to complete.
To navigate these problems, qgrep maintains a compressed representation of the whole searchable codebase. It’s stored in one file that includes chunks; each portion consists of a list of file courses and file data, both compressed utilizing LZ4. Files are contributed to the present chunk up till the uncompressed size reaches a specific threshold (512 KB at the minute) and after that the chunk is compressed and a brand-new chunk is started. If a file is huge (which is in fact the case for vulkan.hpp
at 2.2 MB), it’s broken up into a number of chunks utilizing newlines as a splitting point which works well due to the fact that the search is line based.
Chunks are compressed and decompressed atomically in the sense that you can’t update just one file in the portion – compressor gets a blob that includes contents of all files concatenated together. Portions that are too little reduce the efficiency of compression; pieces that are too big make incremental updates and parallel decompression less efficient.
Having a file which contains compressed representation of all input information suggests that even if the file is not in the file cache, reading this is as quick as possible since it’s frequently stored contiguously on disk. Compression is important due to the fact that HDD read speeds are quite sluggish compared to quick decompressors, and compressing boosts the opportunity the files will remain in cache.
One other huge advantage is that files can be pre-filtered during indexing to eliminate files that are not intriguing to search; qgrep by default consists of the majority of text files using file extension as a filter. This implies that large folders with, say, construct artifacts, only affect the time it takes to update the information, and don’t effect search performance. On UE4 example above, ripgrep browse 1.2 GB of file data, and qgrep just browses 1.0 GB (compressed to 212 MB).
Why LZ4?
An essential objective when picking the compression algorithm was to have decompression run at similar efficiency to the search itself (which, as you will see quickly, can be actually effective). While it may seem that having a slower decompressor suffices as long as it’s much faster than the disk, when the file remains in cache the disk I/O isn’t an aspect so the much faster the decompressor is, the much better.
At the time a strong option for having extremely fast decompression efficiency was LZ4; today Zstd offers a fascinating option that I haven’t evaluated carefully for this usage case. Stylish was an alternative available at the time however it had slower decompression, and didn’t have a high-compression choice. LZ4 likewise made a lot of progress for many years; numerous updates to lz4 since 2012 enhanced search efficiency on decompression-heavy inquiries by ~50%in total.
When qgrep was made at first, all focus was on search efficiency, and the time it required to update the qgrep information wasn’t that important. Due to the fact that of this, LZ4HC was best for the task – at the cost of spending more time when compressing the information, it gave much better compression ratios. In later releases a few modifications were made to deal with the slow compression efficiency:
- Portion compression was relocated to a separate thread so that it might run in parallel with reading files off disk and carrying out newline conversion and the like, and writing resulting piece data to disk
- A lower compression level was utilized (which is a relatively recent addition to LZ4HC) to compromise a little bit of compression ratio for a lot of compression time.
- The default method to update qgrep data is now incremental – if no files inside a chunk changed and the piece sizes don’t require to be rebalanced to preserve an affordable typical size, no recompression is performed
As an outcome, it takes 6.5 seconds with hot file cache to rebuild UE4 search information from scratch and 1.2 seconds to update a single file in it; in the latter case the majority of the time is spent passing through file directory structure.
Multi-threading
Naturally, qgrep attempts to multithread the search procedure. There’s a single thread that checks out pieces off disk, and dispatches them to a thread swimming pool. The threads in the pool handle decompression of piece information and the search itself.
Several essential issues develop throughout this procedure.
Initially, the read speed and search speed can be substantially out of balance. For instance, if the information is in file cache, reading it from disk performs at memory speed, whereas decompression and search are typically slower. If the data is actually large, it could be the case that the overall memory effect of the portion line is significant. Due to the fact that of this, qgrep extensively utilizes a thread-safe obstructing queue that has an unique twist – when the queue is produced, it’s being told just how much overall memory the queue must hold at many. Whenever a product is pressed to the queue, if the product’s memory effect surpasses the budget, the thread that presses the item waits on “area” to appear. This makes certain that we can browse arbitrarily large datasets with limited memory.
2nd, search threads will produce lead to an arbitrary order – each thread will produce search outcomes in each piece in order, however we want to display arise from different chunks to the user in the specific same order that a single thread would. This indicates we require to buffer the output results – nevertheless, for some search queries the size of that buffer can be arbitrarily large. To resolve this problem, qgrep utilizes a special purpose bought queue – the method it works is that the producers (search threads) submit items to that queue with an index showing the initial piece index, and the single consumer that prints the output processes items in the order suggested by this index. For instance, if 3 threads finish pieces 1,2,3 rapidly and the 4th thread takes a while to end up piece 0, the output for chunks 1,2,3 will remain in memory till chunk 0 is processed and added to the line, after which the consumer thread can process output with indices 0,1,2,3 in order.
3rd, a lot of care requirements to be required to optimize memory allocations. A big source of efficiency concerns can be assigning and deallocating large blocks of memory due to the fact that on lots of systems doing so will repeatedly release the memory to the system, assign it again, and trigger lots of page faults that can lead to serializing execution, as explained in my older post This issue is solved by utilizing a pool for big allotments (which is somewhat counter instinctive considering that it’s more typical to pool small allowances than large ones).
Finally, for big matches, the procedure of highlighting the match results might take rather a bit of time – once we understand there’s a match in a given line, we need to discover the bounds of this match, and create suitable markup for the terminal output with colors. Because of this, highlighting runs in the search thread and the output portions consist of final formatted output so that the single output consumer thread only needs to print the outcomes to the terminal.
Regular expression search
After decompressing chunk contents, we utilize a regular expression engine to find the match. The engine that qgrep uses is RE2; the design and performance characteristics are described here My understanding is that the style of ripgrep is comparable, but I’m not an expert on the matter – I profiled RE2 in 2012 and it was faster than other engines I tested. Ever since I’ve utilized RE2 in some other applications and the only engine I’ve seen that might beat RE2 depending on the task was PCRE JIT, which wasn’t readily available in 2012 (and for qgrep specifically I’m not sure if the JIT time will pay for the increased performance considering that we’re interested in end-to-end time of the query and aren’t able to cache the JIT code).
The search is done on a line by line basis, however rather of feeding each line to the routine expression engine at a time, the regular expression is ran on the whole file at a time; whenever there’s a match, we output that match and leap to the beginning of the next line to continue the search. This leads to ideal performance as the expense of preparing the match data is lessened, and the numerous scanning optimizations like memchr
mentioned listed below can genuinely shine.
One notable weak point of RE2 is case-insensitive searches; it’s a basic residential or commercial property of other routine expression engines also – case insensitive matches produce far more complex state makers; additionally one reason why RE2 is quick is it uses memchr
when trying to scan for an actual submatch of a routine expression (in the test question from this post this occurs for ‘v’ and ‘K’) rather of putting one character through the automata at a time, and case insensitive searches invalidate this. Since case insensitive searches are very common, qgrep takes a shortcut and presumes that case insensitive searches just require to handle ASCII – when that presumption holds, we can transform both the regular expression and file contents to ASCII lower case. For file data, performance of this transform is vital so we use SSE2 optimized casefold function with the basic loop that processes 16 bytes at a time:
// Shift 'A'.' Z' variety ([65..90]) to [102..127] to use one signed contrast insn.
__ m128 i shiftAmount =-LRB- *****) _ mm_set1_epi8(127-' Z');-LRB- *****).
__ m128 i lowerBound =-LRB- *****) _ mm_set1_epi8(127-(' Z'-' A')- 1);-LRB- *****).
__ m128 i upperBit =-LRB- *****) _ mm_set1_epi8( 0x20);-LRB- *****).
__ m128 i v =-LRB- *****) _ mm_loadu_si128( reinterpret_cast<const __m128i*> ( i));-LRB- *****). __ m128 i upperMask =-LRB- *****) _ mm_cmpgt_epi8( _ mm_add_epi8( v, shiftAmount), lowerBound);-LRB- *****). __ m128 i cfv =-LRB- *****) _ mm_or_si128( v,<__m128i*>_ mm_and_si128( upperMask, upperBit));-LRB- *****). _ mm_storeu_si128( reinterpret_cast( dest ), cfv);-LRB- *****). dest =-LRB- *****)16;-LRB- *****).
Which is an ethical equivalent of computing unsigned( ch -' A')