Your comment was useful, so not totally in vein :-) Just incited me to have a li...

ygra · on June 3, 2015

You found three ways of doing things that all do filtering at a different level. The -Filter parameter employs filtering on the provider side¹ of PowerShell, i.e. in the part that queries the file system directly. Essentially your filter is probably passed directly to FindFirstFile/FindNextFile which means that fewer results have to travel fewer layers. The -Path parameter (implied in ls .exe as it's the first parameter) works within the cmdlet itself, which is also evident in that it supports a slightly different wildcard syntax (you can use character classes here, but not in -Filter) because it is already over in PowerShell land. The slowest option here pushes filtering yet another layer up, by using the pipeline, so you get two cmdlets, passing values to each other and that's some overhead as well, of course. Note that the most flexible option here is by combining Get-ChildItem with Where-Object, whereas the direct equivalent in Unix would probably be find which replicates some of ls' functionality² to do what it does, which probably places it in a very similar spot to ls, performance-wise.
It's not uncommon for the most flexible option to be the slowest, though. In my own tests my results were 18 ms, 115 ms and 140 ms for doing those commands in $Env:Windir\system32, so the difference wasn't as big as in your case. For a quick command on the command line I feel performance is adequate in either case, unless you're doing things with very large directories. If you handle a large volume of data, regardless of whether it's files, lines, or other objects, you probably want to filter as much as you can as close to the source as you can – generally speaking.

As for buffering ... I'm not aware of, unless the cmdlet needs to have complete output of the previous one to do its work. Every result from a pipeline is passed individually from one cmdlet to the next by default. Some cmdlets do* buffer, though, e.g. Get-Content has a -ReadCount parameter that controls buffering in the cmdlet (man gc -param readcount). Sort-Object and Group-Object are the most common (for me at least) that always need complete output of the stage before to return anything, for quite obvious reasons.

However, even though I did some work on Pash, the open-source reimplementation of PowerShell, I'm not terribly well-versed in its internal workings, so take the buffering part with a grain of salt.

As for completeness, well, the Unix ecosystem has an enormous edge here, simply by having been there for decades and amassing tools and utilities. Since PowerShell was intended for system administrators you can expect nearly everything needed there to have PowerShell-native support. This includes files, processes, services, event logs, active directory, and various other things I know little to nothing about. Get-Command -Verb Get gives you a list of things that are supported directly that way. It seems like even configuration things like network, disks and other such things are supported by now. At Microsoft there's a rule, I think, that every new configuration GUI in Windows Server has to be built on PowerShell. Which means, everything you can do in the GUI, you can do in PowerShell, and I think you can in some cases even access the script to do the changes you just made in the GUI – e.g. for doing the same change on a few hundred machines at once, or whatever.

Of course, you can just work with any collection of .NET objects by virtue of the common cmdlets working with objects (gcm -noun object). For me, whenever there is no native support, .NET is often a good escape hatch, that in many cases isn't terribly inconvenient to use. You also have more control over what exactly happens at that level, because you're one abstraction level lower. As a last resort, it's still a shell. It can run any program and get its output. Output from native programs is returned as string[], line by line, and in many cases that's not worse than with cmd or any Unix shell.

_____

¹ Keep in mind, the file system is just one provider and there are others, e.g. registry, cert store, functions, variables, aliases, environment variables that work with exactly the same commands. That's why ls is an alias for Get-ChildItem and there is no Get-File, because those commands are agnostic of the underlying provider.

² So much for do one thing – but understandable, because ls' output is not rich enough to filter for certain things further down in the pipeline.

aidos · on June 3, 2015

Oh yeah, I can see how the different filtering spots would make a difference.

Was just a bit surprised at the overhead of adding a command to the pipeline. A similar setup on linux would be like the following I guess (on a folder with 4600 files, 170 matching files).

    20ms time ls *.pdf
    35ms time find -maxdepth 1 -iname '*.pdf'
    60ms time ls | egrep '.pdf$'

I was more wondering if adding each additional command added so much overhead. Your numbers looks much more reasonable. Maybe the article I read had a big step due to filesystem / caching or something.