Oscar resells MagnaCare insurance, and they're not even the only ones on the NY exchange to do it. I wish they'd actually do something different in the insurance market, but Oscar is more of the same.
pcre_exec()'s length and offset parameters are ints, so there's not much I can do about files over 2GB. I really don't want to split the file into chunks and deal with matches across boundaries. That's just asking for bugs. I guess I could make literal string searches work, at least on 64 bit platforms.
Honestly though, I don't think ag is the right tool for that job. For a single huge file, grep is going to be the same speed. Possibly faster, since grep's strstr() has been optimized for longer than I've been alive.
I gave some thought to the right tool for the job of searching DNA.
DNA files don't change very often, which makes building an index worthwhile. Apparently, sequencing isn't perfect and neither are cells, so you'd want fuzzy matching. But repeats in DNA are also common, so that means fuzzy regex matching. There is already a fuzzy regex library[1], but I have no idea how fast it is. If the application requires performance above everything, an n-gram index sounds like the right tool for the job.
After writing the paragraph above, I searched for "DNA n-gram search." The original n-gram paper from 2006 used DNA sequences in their test corpus.[2] I don't know much about DNA or the applications built around it, so I'm glad I managed to recommend a tool that was designed for the job.
I built ag for myself; both as a tool and to improve my skills profiling, benchmarking, and optimizing. Had I known how popular it would become, I would have definitely held myself to a higher standard, or any standard. Most importantly, I'd have written tests. These days, I'm busy with a startup so progress on those fronts has been slow.
ag is incredible, especially paired with Ack.vim and a mapping. I use <leader>as to search for the current word under the cursor. The results are instantaneous. With ag and YouCompleteMe, I never fall back to cscope/ctags in C++ projects anymore.
One thing though, it skips certain source files seemingly arbitrarily without the -t param and I haven't figured out why... Doesn't seem related to any .gitignore entries that I have been able to identify.
The silver searcher is pretty good. but it has a couple of big problems. It does not parse the .gitignore correctly [0], so it frequently searches files that are not committed to your repo. This, combined with the decision to print 10000 character long lines mean a lot of search results are useless.
I noticed the issue you mentioned, but as the last comment mentions, I believe this has already been fixed. My specific case at least was resolved by updating from master.
One thing I miss a little is that ack has the super convenient:
ack --java "foo"
while with ag you write:
ag -G"\.java$" "foo"
But yes, ack and ag feel pretty identical except for the speed. Most of the time the speed improvement is irrelevant to me, except sometimes now I'll use ag in my home folder, and it's still fairly snappy.
That was too much typing anyway. When you mostly work with one language something like this is nice (in my case c/c++):
alias ack-cpp='ack-grep --type=cpp --type=cc'
Hm, I've recently begun using zsh primarily and this trick doesn't work there: zsh lets you know what the alias is... bash will happily find `rack` in your `$PATH` and then run it.
(Presumably because in zsh, `which which` says it's a shell built-in, whereas in bash it finds `/usr/bin/which`, so bash doesn't seem to be caring about your aliases.)
I normally tell people to use ack because it's like grep but faster (owing to it's sensible defaults) ... if I use this I'm worried I might go too fast and travel backwards in time or something.
In my benchmarking, mmap() was about 20% faster than read() on OS X, but the same speed on Ubuntu. Pretty much everything else in the list (pthreads, JIT regex compiler, Boyer-Moore-Horspool strstr(), etc) improves performance more than mmap().
Also, mmap() has the disadvantage that it can segfault your process if something else makes the underlying file smaller. In fact, there have been kernel bugs related to separate processes mmapping and truncating the same file.[1] I mostly use mmap() because my primary computer is a mac.
Now I'm burning with curiosity. I have to know why! My plan:
- replicate the experiment, confirm --mmap shaves off a non-negligible amount of time. It could be that his computer happened to be running something in the background that was using his harddrive, for example, which would skew the results.
- look at the code, figure out the exact difference between what --mmap is doing and what it does by default. Confirm that the problem isn't in grep itself (it's probably not, but it's important to check).
- dig into the kernel source to figure out the difference under the hood and why it might be faster.
I wonder if it has to do with not having to copy data back and forth between kernel and userspace. My mildly uneducated thought is that you could do this with splice() or whatever, but mmap is an easy drop-in replacement.
edit: I've been reading your posts for a while and I like them, but I keep wondering, why do you have sillysaurus1-2-3?
That's what has me so curious, because it doesn't seem like copying between kernel/userspace should account for a 20% speed drop. Once data is in the L3 CPU cache, it should be inexpensive to move it around.
Regarding my ancestry, I'm sillysaurus3 because I've (rightfully) been in trouble twice with the mods for getting too personal on HN. I apologized and changed my behavior accordingly, and additionally created a new account both times to serve as a constant reminder to be objective and emotionless. There's rarely a reason to argue with a person rather than with an idea. Debating ideas, not people, has a bunch of nice benefits: it's easier to learn from your mistakes, it makes for better reading, etc. It's pretty important, because forgetting that principle leads to exchanges like https://news.ycombinator.com/item?id=7700145
Another nice benefit of creating a new account is that you lose your downvoting privilege for a time, which made me more thoughtful about whether a downvote is actually justified.
Possibly the OS is doing interesting things with file access and caching and opting out of that has benefits for this particular workload?
...
I just skimmed the bsd mailing list email on why grep is fast which was linked up-thread, and it seems that's somewhat the case. It sounds like since they are doing advanced search techniques on what matches or can match, they use mmap to avoid requiring the kernel copy every byte into memory, when they know they only need to look at specific ranges of bytes in some instances. At least that was the case at some point in the past.
Finally, when I was last the maintainer of GNU grep (15+ years ago...),
GNU grep also tried very hard to set things up so that the _kernel_
could ALSO avoid handling every byte of the input, by using mmap()
instead of read() for file input. At the time, using read() caused
most Unix versions to do extra copying.
P.S. Nice attitude, it earned an upvote from me. Which is probably one reason why your third account has more karma than my first.
Right, I think the point of boyer-moore is that it allows to eliminate / skip large chunks of the text during the search.
So the assumption is that those pages don't even ever get swapped in, but I think that'd only be the case when the pattern size is at least as large as the page size (usually 4KB!), which is not the case in the example in the mailing list. So the mystery continues!
The last time I had to do fast, large sequential disk reads on Linux it was surprisingly complex to get all the buffering/caching/locking to not do the wrong thing and slow me down a lot. I wouldn't be surprised if non-optimized mmap() is a whole lot faster than non-optimized use of high level file i/o libraries.
If anything, that post is evidence of how tricky optimization is, and how easy it is to fool yourself about what matters. It's probably best to be skeptical about mmap() as a performance optimization over reading into a buffer unless evidence demonstrates otherwise. Most OS's do a pretty good job of caching at the filesystem level, and under the hood paging is essentially reading into a buffer anyway. mmap() might make the code simpler, but it's hard to imagine it makes it faster. If it does, I'd like to understand why.
So are we talking about constant-time optimization, then? I.e. it shaves off a few milliseconds regardless of how complex the search is, or how many files it's reading, or how large each file is. I'll happily concede that mmap() might do that. But a performance boost linear w.r.t. search complexity/number of files/filesize? Hard to believe, and I should go measure it myself to prove the point or learn why I'm mistaken.
Constant-time improvements are still improvements, especially if they're in an inner loop. Otherwise we would all be using Python and just writing great algorithms.
I'm assuming - and this could be a bad assumption - but based on the continuing updates architecture, if client a changes model a, client b will see an update on model a. How does client b get notified of the change? Does it have some kind of fallback system ala socket.io? Is this not yet part of the project?
Judging by the chart on the site, it looks like each client polls the original model on the server, and infers changes by diffing it against the client's current copy.
While OTs (there are several variants) are a promising approach for distributed authoring, I think the complexity of implementing them is still prohibitive. Surely there is a better way...
I have been reading papers on this looking for a "clean" way to solve this (in the context of packet loss, latency, etc).
There are other approaches out there. One example which uses character based changes and is worth checking out is here:
PAPER wikisym.org/ws2010/tiki-download_wiki_attachment.php?attId=15
CODE https://github.com/gritzko/ctre
I feel like it is time someone solved this collab. editing thing once and for all and shared the code with everyone. (Firepad? https://github.com/firebase/firepad/)
I wrote a simulator to help me understand Differential Sync. The nice part about DS unlike OT is that you can still work offline and sync later. Here is the simulator:
Yup, I've actually read all those papers, and then some. I think OT is often misunderstood and poorly documented, and really needs a solid go-to library. (Along with clear, accessible documentation.)
ShareJS and "Operational-Transformation" (used by Firepad) are decent, but in my opinion aren't general enough and may even have some algorithmic flaws.
And yes, it's time this was solved, and it wasn't necessary to re-implement the basics for every collaborative project.
http://sharejs.org/ is in use for a number of production sites, and we've had users report that it is quite reliable. How would you like to see it further generalized?
OT works naturally with linear data structures, but is tougher to use correctly with more complex data, which Mozilla seems to have here with Towtruck.
No approach to technology architecture is a "silver bullet," especially with a stack spread across client and server.
I always find myself returning to the "boring, old-fashioned" way of doing things (like server-side processing or relational databases) and these do indeed seem like the best choices for many applications.
But I don't see how this calls client-side MVC into question, except that perhaps it's considered as a default choice too often.
While looking beyond a pg_dump-style approach to backup/recovery, I was considering https://github.com/heroku/WAL-E and discovered Barman. It's also open source, and looks like a strong contender.