More

pie · on May 15, 2014

Oscar resells MagnaCare insurance, and they're not even the only ones on the NY exchange to do it. I wish they'd actually do something different in the insurance market, but Oscar is more of the same.

myawscreds · on May 15, 2014

I think you're confusing an insurance company and a provider network. Magnacare is a provider network, not an insurance company.

pie · on May 7, 2014

It's also worth checking out The Silver Searcher: https://github.com/ggreer/the_silver_searcher

ihodes · on May 7, 2014

Big problem with ag is that it appears to be broken on even moderately larger files where ack works just fine: https://github.com/ggreer/the_silver_searcher/issues/384

ggreer · on May 7, 2014

pcre_exec()'s length and offset parameters are ints, so there's not much I can do about files over 2GB. I really don't want to split the file into chunks and deal with matches across boundaries. That's just asking for bugs. I guess I could make literal string searches work, at least on 64 bit platforms.

Honestly though, I don't think ag is the right tool for that job. For a single huge file, grep is going to be the same speed. Possibly faster, since grep's strstr() has been optimized for longer than I've been alive.

ggreer · on May 7, 2014

I gave some thought to the right tool for the job of searching DNA.

DNA files don't change very often, which makes building an index worthwhile. Apparently, sequencing isn't perfect and neither are cells, so you'd want fuzzy matching. But repeats in DNA are also common, so that means fuzzy regex matching. There is already a fuzzy regex library[1], but I have no idea how fast it is. If the application requires performance above everything, an n-gram index sounds like the right tool for the job.

After writing the paragraph above, I searched for "DNA n-gram search." The original n-gram paper from 2006 used DNA sequences in their test corpus.[2] I don't know much about DNA or the applications built around it, so I'm glad I managed to recommend a tool that was designed for the job.

1. https://github.com/laurikari/tre/ (used by agrep)

2. Fast nGram-Based String Search Over Data Encoded Using Algebraic Signatures http://cedric.cnam.fr/~rigaux/papers/LMRS07.pdf

tveita · on May 7, 2014

If ag knows it can't search the whole file, it should at least give a warning. Or why not use search_stream?

Silently skipping parts of it seems like the worst thing to do.

ggreer · on May 7, 2014

Good point about the warning. I'll add that. With regards to search_stream() in search.c... all I can say is that I'm sorry:

https://github.com/ggreer/the_silver_searcher/blob/master/sr...

I built ag for myself; both as a tool and to improve my skills profiling, benchmarking, and optimizing. Had I known how popular it would become, I would have definitely held myself to a higher standard, or any standard. Most importantly, I'd have written tests. These days, I'm busy with a startup so progress on those fronts has been slow.

masterj · on May 7, 2014

For me ag has been magical. It does exactly what I want 99% of the time and is just blazingly fast.

So.. thanks :)

x0x0 · on May 7, 2014

it's an awesome tool I use dozens of times a day, so thank you

mpercy · on May 8, 2014

ag is incredible, especially paired with Ack.vim and a mapping. I use <leader>as to search for the current word under the cursor. The results are instantaneous. With ag and YouCompleteMe, I never fall back to cscope/ctags in C++ projects anymore.

One thing though, it skips certain source files seemingly arbitrarily without the -t param and I haven't figured out why... Doesn't seem related to any .gitignore entries that I have been able to identify.

ihodes · on May 7, 2014

Good to know, and that makes sense to me. Thank you for adding a warning, as well.

ack turns out to be much faster than grep on these large files, FWIW.

Thanks for making this superb tool :)

i_s · on May 7, 2014

The silver searcher is pretty good. but it has a couple of big problems. It does not parse the .gitignore correctly [0], so it frequently searches files that are not committed to your repo. This, combined with the decision to print 10000 character long lines mean a lot of search results are useless.

[0] https://github.com/ggreer/the_silver_searcher/issues/367 for example

michaelmior · on May 7, 2014

I noticed the issue you mentioned, but as the last comment mentions, I believe this has already been fixed. My specific case at least was resolved by updating from master.

davidgerard · on May 7, 2014

The author notes that git grep is faster than ag when you're lucky enough to be searching a repo.

tveita · on May 7, 2014

I switched to this after Ack 2 removed all options to search through binary files.

ag is less picky, and the increased speed is a nice bonus.

pudquick · on May 7, 2014

Binary file search was my primary motivator as well. I really do love the functionality of the tool.

pmelendez · on May 7, 2014

1000 times this... it is exactly like ack-grep but faster :)

djeikyb · on May 7, 2014

One thing I miss a little is that ack has the super convenient:

    ack --java "foo"

while with ag you write:

    ag -G"\.java$" "foo"

But yes, ack and ag feel pretty identical except for the speed. Most of the time the speed improvement is irrelevant to me, except sometimes now I'll use ag in my home folder, and it's still fairly snappy.

coffeeaddicted · on May 7, 2014

That was too much typing anyway. When you mostly work with one language something like this is nice (in my case c/c++): alias ack-cpp='ack-grep --type=cpp --type=cc'

llimllib · on May 7, 2014

I have that aliased to 'cack'. Then ruby is 'rack', python is 'pack', go is 'gack', etc.

(I've never needed to use the rackup "rack" command directly, fortunately, if you do you ought to use a different alias)

irahul · on May 7, 2014

> I've never needed to use the rackup "rack" command directly, fortunately, if you do you ought to use a different alias

Or escape alias \rack

alxndr · on May 7, 2014

or $(which rack)

alxndr · on May 8, 2014

Hm, I've recently begun using zsh primarily and this trick doesn't work there: zsh lets you know what the alias is... bash will happily find `rack` in your `$PATH` and then run it.

(Presumably because in zsh, `which which` says it's a shell built-in, whereas in bash it finds `/usr/bin/which`, so bash doesn't seem to be caring about your aliases.)

mhei · on May 8, 2014

If you have the EQUALS option set (by default it is), you can use =rack instead: http://zsh.sourceforge.net/Doc/Release/Expansion.html#g_t_00...

alxndr · on May 15, 2014

Lovely, thanks!

beaumartinez · on May 7, 2014

ag has had this for a while now (I'm on version 0.21.0).

djeikyb · on May 7, 2014

nice! i need to pull and recompile!

tsenkov · on May 7, 2014

Judging by this: https://news.ycombinator.com/item?id=7710269 I guess ag now supports that, too.

rickyc091 · on May 7, 2014

Thanks for the ack tip! Anything else super useful come to mind?

petdance · on May 7, 2014

ag is not "exactly like" ack. There are features that ack has that ag has chosen not to replicate. ack is also more portable.

That's not a knock on ag at all, and if ag fits your needs, then by all means use it.

5h · on May 7, 2014

I normally tell people to use ack because it's like grep but faster (owing to it's sensible defaults) ... if I use this I'm worried I might go too fast and travel backwards in time or something.

csgavino1 · on May 7, 2014

Give ag a shot, you'll be able to relive the emotions you felt when you switched to ack from grep, but this time you're switching to ag from ack.

sillysaurus3 · on May 7, 2014

How is it so fast? Files are mmap()ed instead of read into a buffer.

It's hard to believe this would give a significant performance boost. Is there evidence of this?

ggreer · on May 7, 2014

In my benchmarking, mmap() was about 20% faster than read() on OS X, but the same speed on Ubuntu. Pretty much everything else in the list (pthreads, JIT regex compiler, Boyer-Moore-Horspool strstr(), etc) improves performance more than mmap().

Also, mmap() has the disadvantage that it can segfault your process if something else makes the underlying file smaller. In fact, there have been kernel bugs related to separate processes mmapping and truncating the same file.[1] I mostly use mmap() because my primary computer is a mac.

A side note: Parts of OS X's kernel seem... not very optimized to say the least. See the bar graph at http://geoff.greer.fm/2012/09/07/the-silver-searcher-adding-... for an example.

1. http://lwn.net/Articles/357767/

tobinfricke · on May 7, 2014

There is a nice (and often posted) mailing list post explaining some of the reasons GNU Grep is so fast:

http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...

He mentions: "So even nowadays, using --mmap can be worth a >20% speedup."

sillysaurus3 · on May 7, 2014

Now I'm burning with curiosity. I have to know why! My plan:

- replicate the experiment, confirm --mmap shaves off a non-negligible amount of time. It could be that his computer happened to be running something in the background that was using his harddrive, for example, which would skew the results.

- look at the code, figure out the exact difference between what --mmap is doing and what it does by default. Confirm that the problem isn't in grep itself (it's probably not, but it's important to check).

- dig into the kernel source to figure out the difference under the hood and why it might be faster.

makmanalp · on May 7, 2014

I wonder if it has to do with not having to copy data back and forth between kernel and userspace. My mildly uneducated thought is that you could do this with splice() or whatever, but mmap is an easy drop-in replacement.

edit: I've been reading your posts for a while and I like them, but I keep wondering, why do you have sillysaurus1-2-3?

sillysaurus3 · on May 7, 2014

That's what has me so curious, because it doesn't seem like copying between kernel/userspace should account for a 20% speed drop. Once data is in the L3 CPU cache, it should be inexpensive to move it around.

Regarding my ancestry, I'm sillysaurus3 because I've (rightfully) been in trouble twice with the mods for getting too personal on HN. I apologized and changed my behavior accordingly, and additionally created a new account both times to serve as a constant reminder to be objective and emotionless. There's rarely a reason to argue with a person rather than with an idea. Debating ideas, not people, has a bunch of nice benefits: it's easier to learn from your mistakes, it makes for better reading, etc. It's pretty important, because forgetting that principle leads to exchanges like https://news.ycombinator.com/item?id=7700145

Another nice benefit of creating a new account is that you lose your downvoting privilege for a time, which made me more thoughtful about whether a downvote is actually justified.

kbenson · on May 7, 2014

Possibly the OS is doing interesting things with file access and caching and opting out of that has benefits for this particular workload?

...

I just skimmed the bsd mailing list email on why grep is fast which was linked up-thread, and it seems that's somewhat the case. It sounds like since they are doing advanced search techniques on what matches or can match, they use mmap to avoid requiring the kernel copy every byte into memory, when they know they only need to look at specific ranges of bytes in some instances. At least that was the case at some point in the past.

Finally, when I was last the maintainer of GNU grep (15+ years ago...), GNU grep also tried very hard to set things up so that the _kernel_ could ALSO avoid handling every byte of the input, by using mmap() instead of read() for file input. At the time, using read() caused most Unix versions to do extra copying.

P.S. Nice attitude, it earned an upvote from me. Which is probably one reason why your third account has more karma than my first.

makmanalp · on May 7, 2014

Right, I think the point of boyer-moore is that it allows to eliminate / skip large chunks of the text during the search.

So the assumption is that those pages don't even ever get swapped in, but I think that'd only be the case when the pattern size is at least as large as the page size (usually 4KB!), which is not the case in the example in the mailing list. So the mystery continues!

ori_b · on May 7, 2014

You can do this with large reads:

     read(fd, buf, 100 megs)

can, in the kernel, do something like:

     read(fd, buf[0:first-page-boundary])
     remap(fd, buf[first-page-boundary:last-page-boundary])
     read(fd, buf[last-page-boundary:end])

There you go, zero copy reads. Or at least, minimal copy reads -- at most you will get 2 pages worth of copying.

bcoates · on May 7, 2014

The last time I had to do fast, large sequential disk reads on Linux it was surprisingly complex to get all the buffering/caching/locking to not do the wrong thing and slow me down a lot. I wouldn't be surprised if non-optimized mmap() is a whole lot faster than non-optimized use of high level file i/o libraries.

on May 7, 2014

[deleted]

sillysaurus3 · on May 7, 2014

If anything, that post is evidence of how tricky optimization is, and how easy it is to fool yourself about what matters. It's probably best to be skeptical about mmap() as a performance optimization over reading into a buffer unless evidence demonstrates otherwise. Most OS's do a pretty good job of caching at the filesystem level, and under the hood paging is essentially reading into a buffer anyway. mmap() might make the code simpler, but it's hard to imagine it makes it faster. If it does, I'd like to understand why.

EDIT: The post was http://geoff.greer.fm/2012/08/25/the-silver-searcher-benchma...

duaneb · on May 7, 2014

> It's hard to believe this would give a significant performance boost.

Why is that so hard to believe? It's a standard optimization—the kernel can almost certainly coordinate reading better than your userspace C can.

sillysaurus3 · on May 7, 2014

So are we talking about constant-time optimization, then? I.e. it shaves off a few milliseconds regardless of how complex the search is, or how many files it's reading, or how large each file is. I'll happily concede that mmap() might do that. But a performance boost linear w.r.t. search complexity/number of files/filesize? Hard to believe, and I should go measure it myself to prove the point or learn why I'm mistaken.

Crito · on May 7, 2014

Likely a constant time improvement... for each file being searched.

I don't think that anybody is claiming that mmapping actually changes the algorithmic complexity of the actual search operations.

duaneb · on May 8, 2014

Constant-time improvements are still improvements, especially if they're in an inner loop. Otherwise we would all be using Python and just writing great algorithms.

pie · on April 22, 2014

Another useful tool for working with spreadsheets in Javascript is SheetJS: https://github.com/SheetJS/js-xlsx

pie · on Jan 4, 2014

For a hackable, open source collaborative framework, check out Mozilla's TogetherJS: https://togetherjs.com/

tmikaeld · on Jan 4, 2014

Except that it only supports viewing one persons view and doesn't support changes in content on both sides.

And certainly not any recording.

pie · on July 7, 2013

The readme says it works with any "persistent backend such as a REST API or socket connection".

It looks like it tries to be backend-agnostic, and do the sync work on the client side.

kanja · on July 7, 2013

I'm assuming - and this could be a bad assumption - but based on the continuing updates architecture, if client a changes model a, client b will see an update on model a. How does client b get notified of the change? Does it have some kind of fallback system ala socket.io? Is this not yet part of the project?

randall · on July 7, 2013

Backbone took a REST opinion, I imagine if you want something like you describe, they (or someone) will implement something like backbone.io.

https://github.com/scttnlsn/backbone.io

pie · on July 7, 2013

Judging by the chart on the site, it looks like each client polls the original model on the server, and infers changes by diffing it against the client's current copy.

pie · on May 14, 2013

I've talked with the Scratch team and I believe they went with Flash to serve the many schools with low-end machines and older browsers.

pie · on April 14, 2013

I am very curious to see what happens with this issue: https://github.com/mozilla/towtruck/issues/5

Given Mozilla's profile and resources, maybe they'll come up with a novel solution!

ivansavz · on April 14, 2013

While OTs (there are several variants) are a promising approach for distributed authoring, I think the complexity of implementing them is still prohibitive. Surely there is a better way...

I have been reading papers on this looking for a "clean" way to solve this (in the context of packet loss, latency, etc).

One good approach I was able to understand is Neil Fraser's Differential Sync http://neil.fraser.name/writing/sync/eng047-fraser.pdf

There are other approaches out there. One example which uses character based changes and is worth checking out is here: PAPER wikisym.org/ws2010/tiki-download_wiki_attachment.php?attId=15 CODE https://github.com/gritzko/ctre

I feel like it is time someone solved this collab. editing thing once and for all and shared the code with everyone. (Firepad? https://github.com/firebase/firepad/)

dugmartin · on April 15, 2013

I wrote a simulator to help me understand Differential Sync. The nice part about DS unlike OT is that you can still work offline and sync later. Here is the simulator:

http://dougmart.in/dssimulation/

pie · on April 14, 2013

Yup, I've actually read all those papers, and then some. I think OT is often misunderstood and poorly documented, and really needs a solid go-to library. (Along with clear, accessible documentation.)

ShareJS and "Operational-Transformation" (used by Firepad) are decent, but in my opinion aren't general enough and may even have some algorithmic flaws.

And yes, it's time this was solved, and it wasn't necessary to re-implement the basics for every collaborative project.

nateps · on April 15, 2013

http://sharejs.org/ is in use for a number of production sites, and we've had users report that it is quite reliable. How would you like to see it further generalized?

friism · on April 14, 2013

Parts of Google's Operational Transformation algorithms have been open sourced, here's a good starting point: http://stackoverflow.com/questions/2043165/operational-trans...

pie · on April 14, 2013

Yes, and Wave is probably the most accurate open source picture of what Google is using for Docs: http://incubator.apache.org/wave/

OT works naturally with linear data structures, but is tougher to use correctly with more complex data, which Mozilla seems to have here with Towtruck.

I'm excited to see what approach they choose!

DannyBee · on April 15, 2013

https://github.com/mozilla/towtruck/issues/5

pie · on Feb 26, 2013

No approach to technology architecture is a "silver bullet," especially with a stack spread across client and server.

I always find myself returning to the "boring, old-fashioned" way of doing things (like server-side processing or relational databases) and these do indeed seem like the best choices for many applications.

But I don't see how this calls client-side MVC into question, except that perhaps it's considered as a default choice too often.

pie · on Sept 20, 2012

While looking beyond a pg_dump-style approach to backup/recovery, I was considering https://github.com/heroku/WAL-E and discovered Barman. It's also open source, and looks like a strong contender.

pie · on Sept 5, 2012

Susy (http://susy.oddbird.net) is another well-designed option for semantic grids that's been around for awhile.

bradgessler · on Sept 5, 2012

Also, check out the Vertical Rhythm Compass mix-in (http://compass-style.org/reference/compass/typography/vertic...)

It works great with Susy.