More

toddkaufmann · on July 14, 2015

Looks interesting for some tables, but I'm not sold yet. Seems to (mostly) only show analysis of a single dimension at a time, though it looks like there is some scrubbing capability?

For multiple dimensions (and I would consider a table with N columns akin to a list of N-dimensional points), GGobi has a number of tools for showing the relation and co-relations. A brief demo of only a few of the features are illustrated here: https://vimeo.com/12292239 Parallel coordinates are not show in that video (it is in others), but is something that looks like it could easily fit in the top part of your table.

I know the names Ramana Rao and Stuart K. Card but hadn't seen that paper before, I'll have to look closer. Diane Cook (creator of xgobi and ggobi) also had a paper in IEEE Vis that same year.

toddkaufmann · on July 3, 2015

Programmers of the world, you are at the forefront of job creation!

toddkaufmann · on June 8, 2015

Very interesting. So Zatocoding [1] allows for a card to have multiple index entries, and then any of those could be used to retrieve it.

For completeness, the 1951 paper is here [1]. Apparently has been covered in undergraduate algorithms at U. of I. [2] (with Bloom filters) so be getting more exposure.

Is the "needed improvement" of the 1976 paper due to better methods available (e.g. proof rigor, or understanding of sorting algorithms) or a better explanation of the methods (perhaps because of more widespread knowledge of information theory, and a better defined terminology).

I thought edge-notched cards [3] had been used for a long time (and have: since 1896)--I remember reading about them used for fingerprint card retrieval (indexed by feature, on each finger. This didn't use superimposed coding, but instead was a type of content-addressable memory--the time to retrieve all cards with e.g. a whorl on the thumb is O(1).

Apparently the cards are still used some places, see Kevin Kelly's [3] site for some images and interesting comments.

Finally found a fingerprint filing reference: "For example, as early as 1934 the FBI tried a punchcard and sorting system for searching fingerprints, but the technology at that time could not handle the large number of records in the Ident files." [5]

    1. https://courses.engr.illinois.edu/cs473/fa2013/misc/zatocoding.pdf
       (or buy from Wiley for $38)
    2. https://courses.engr.illinois.edu/cs473/fa2013/lectures.html
    3. http://en.wikipedia.org/wiki/Edge-notched_card
    4. http://kk.org/thetechnium/one-dead-media/
    5. Chapter 3: Evolution to Computerized Criminal History Records
       https://www.princeton.edu/~ota/disk3/1982/8203/820306.PDF

dalke · on June 8, 2015

Mooers' original work was presented at the ACS in the September 1947. Knuth cites that, as well as the Am. Doc. (1951) citation you gave, in TAOCP v3 p571. His 1948 Master's thesis from MIT, which goes through the derivation, is at http://dspace.mit.edu/handle/1721.1/12664 . This is why I say it's from the 1940s, not 1950s. I also think the chapter "Mathematical Analysis of Coding Systems" by Carl Wise, from "Punched cards; their applications to science and industry" (2nd ed. 1958), at http://babel.hathitrust.org/cgi/pt?id=uc1.b3958636;view=1up;... gives an excellent treatment of the topic.

Nice find with the U. of I. course. Interestingly, if you listen to the presentation the speaker says: the topic won't be on the test, perhaps the rigor behind the method isn't as good as desired for the class, but it's something the students ought to know exists. Plus the speaker "just found out about" the topic (at 08:19), like you, cites a 1950s date, and describes it as using a fixed number of bits per category .. which is why he says that Zatocoding "reappears" years later a a Bloom filter. They are both superimposed codes, but not the same thing.

I find the mention of the two problems with superimposed codes to be interesting. 1) researchers don't like false drops and instead expect perfect matches (in Mooers' chapter in 'Punched Cards' his spins it as serendipitous matches; Knuth does something similar in TAOCP with 'false drop cookies'.), and 2) librarians love to make hierarchical categorizations, which isn't needed with Zatocoding.

(BTW, from what I read, American Documentation was the journal for information theory in the 1950s and covered many topics. I interpret Mooers' paper more as advertisement to a wider audience, because it doesn't go into the technical details of his method. He was trying to drum up work for his consulting business.)

The "needed improvement" comes with chemical descriptors. Suppose one of your descriptors is "contains a carbon", another is "contains 3 carbons in a row" and a third is "contain 6 carbons in a row". I'll write this as 1C, 3C, and 6C. Zatocoding treats all descriptors as independent, though there's a paper where he describes a correction for when there are correlations. But in this case whenever 6C exists then both 3C and 1C also exist. These are hierarchical descriptors.

I was wrong about the date. The improvement paper is Feldman and Hodes, JCICS 1975 15 (3) pp 147-152, not 1976. In the original Zatocoding, the number of bits $k$ is given as -log2(descriptor frequency). In Feldman and Hodes, $k$ for the root descriptors are given the same way, but $k$ for a child descriptor is given as log2(parent frequency/child frequency). It's possible for a fragment to have multiple parents (consider that "CC" and "CO" are both parents of "CCO"). In that case, use the least frequent parent.

In addition, ensure that the bits selected for the child do not overlap with the bits set for the parent.

This ends up giving a first-order correction to Zatocoding for hierarchical data.

The "needed improvement" therefore is the ability to handle hierarchical descriptors, which are frequently found in chemical substructure screens.

While you say "still used some places", Kelly's site concerns dead media. I spent some time trying to find modern edge notched cards, including, though a friend, asking on a mailing list of people interested in old computing tech. No success. There were a couple of used ones on eBay, but I wanted 500 unused ones so I could make a data set myself. Instead, this is the 21st century, and I think I can use a paper die cut machines to make them for me, with pre-cut holes even.

You mention "sorting" several time. My use is for selection, not sorting.

For punched cards the selection time will be O(N). The only way to get O(1) is with an inverted index of just the descriptor you're looking for. Mooers in 1951 proposed 'Doken' (see http://www.historyofinformation.com/expanded.php?id=4243) as a way to search '100 million items in about 2 minutes'.

toddkaufmann · on June 7, 2015

There are tools and data available at the Science of Science (Sci^2) site [1], part of the Cyberinfrastructure for Network Science Center, funded by NSF.

I haven't looked closely in a couple years, but my impression is NSF hopes such tools can help show the effects (and "ROI") of research grant money, connection with PI's and institutions, and their impact both on publication citation and economic impact (development of technology etc.).

Ideally this could also be used to measure growth in technical fields to determine whether more (or less) funding is required to answer bigger questions in basic science (which may not have economic incentives yet), methodologies, public policy, and education (will there be enough Ph.D's in the pipeline to meet demands for fields that will exist in ten years?).

Scientometrics [2] (the journal) has been around for nearly 40 years, and I assume people were thinking about such issues then. Sci^2 looks to me like a more "big data" approach to not only understanding this, but seeing if it is possible to "push" the frontiers (but I admit I don't know anything that goes on at NSF or how their decision-making process works).

Another tool, Publish Or Perish [3], is aimed at individual academics to understand their (or another's) impact in terms of citation metrics that are used in the games for academic (and other) hiring purposes.

I stumbled on Sci^2 when trying to learn some new fields (computer vision, hpc / parallel computing, network science, sensemaking) and wanted to quickly find seminal papers (ie highly cited, or literature reviews) to quickly get a broad overview. Not having the patience or time to read lots, playing with interesting tools and trying to extract data from Google Scholar and the like was more attractive.

Being impatient, I wanted a way to process knowledge like data. To measure something like growth of a field, it seems something like scientometrics with some natural language processing and ontology engineering is needed.

The Google paper seems to be more about an analysis of Google Scholar data and what can be gleaned there. Maybe an update of Google Scholar Metrics is coming? I am surprised no reference to scientometrics in the arXiv paper (maybe they aren't familiar with the literature?).

1. https://sci2.cns.iu.edu/user/index.php 2. http://en.wikipedia.org/wiki/Scientometrics_%28journal%29 3. http://www.harzing.com/pop.htm

toddkaufmann · on May 30, 2015

Go to https://www.meteor.com/ and Start tutorial. After installed you can go through the tutorial in an hour or two depending how deep you go, and it comes with a number of examples (show with "meteor create --list").

Discover Meteor has some more; they were giving their book away (see https://www.discovermeteor.com/blog/we-made-our-book-free/ -- first 4 chapters are free).

That may be enough for the basics. Then look at telescope, or one of the chat apps listed here or search on github. Also plenty of conference talks on youtube.

toddkaufmann · on April 12, 2015

Haven't tried, but I think I would find this intrusive and annoying--I'd be mentally keeping track of how many tabs I had open and then checking my bookmarks if I went over. Not for me, but maybe useful for some.

I've tried session managers too, and those are okay for restoring state when things crash.

THE GREATEST THING is Tabs Outliner (chrome only). This displays all your windows and tabs as a tree and lets you rearrange, easily close and restore windows, selectively restore, add notes to tabs, etc.

A slight learning curve, but worth it for the power surfer. Demo of some of the features in this video: https://www.youtube.com/watch?v=OqjcrfKjobY

toddkaufmann · on April 12, 2015

(this comment has nothing to do with this distro, just the general state of software installation and the transmission of such instructions on the internet.)

I'm not sure anybody else gets you, but I'm totally with changing the copy-paste way of transmitting instructions. And I don't mean a gui and a whole lot of clicking.

System Requirements (hw/sw, pre-reqs) / applicability tests (will this work for me?), tailoring instructions to my configuration(s) (automatically), recommended and trusted instructions (using PKI) with data and statistics to back up the confidence I should have, time estimates and ENV-impact statement, testing and follow up discussion, and transactional rollback of any step to previous state of the system.

See instructions. Install instructions. Should be easy as a click. Something breaks, undo should be just as easy.

Copy and pasting instructions is like typing in BASIC from magazine listings decades ago. "Clipboard is the new fax machine"

At the other extreme are "apps" which are nice when they don't break, but opaque and not developer-friendly.

Willing to discuss elsewhere or learning about existing efforts which try to address these problems.

toddkaufmann · on April 7, 2015

I would like to see a study done on how much ad-blocking can save in terms of energy costs (not running flash ads), bandwidth, and reduction of risk from malware.

toddkaufmann · on March 30, 2015

You don't have to wait. You can turn that off with '-n'.

When you jump to the end of the file ('>' command) you'll see "Calculating line numbers... (interrupt to abort)" and interrupt (ctrl-C) here will also turn them off.

agopaul · on March 30, 2015

I didn't know about the -n switch. Usually I need to get to the bottom of the file and find the last occurrence of a string in the logs (G, CTRL+C to stop counting lines, ?string to search) and the slower part is waiting for less to match a string going backwards on the file. Will check if the -n switch solves the problem

toddkaufmann · on March 30, 2015

Use '/' (search) to enter a regexp to match a pattern on that line, and it will be highlighted.

lgas · on March 30, 2015

This has nothing to do with what is being discussed in this subthread.

teraflop · on March 30, 2015

Sure it does. If your log lines are distinct (e.g. they have timestamps or unique IDs) then you can use less's search highlighting to provide a visual marker for a specific line, similar to what you can do by manually inserting a bunch of blank lines on the console.

This trick doesn't work if your log file has a bunch of identical lines and you want to keep an eye on their rate, though.

gojomo · on March 30, 2015

Having to remember & type a timestamp has much more mental overhead (planning & memory) than "scan/scroll back to last block of vertical whitespace".

That's why suggestions of either named-marks or back-searches aren't considered equally-attractive alternatives to marking the scrollback with a batch of <return>s.

shawabawa3 · on March 30, 2015

imagine the scenario: "I want to see everything that happens in a single request"

Enter method:

  1. Press enter a bunch of times
  2. Reload browser
  3. Press enter a bunch of times and scroll up

Your method:

  1. Search for last line in output to highlight it?
  2. Reload page
  3. Try and figure out where stuff starts and ends with loads of visual noise

teraflop · on March 30, 2015

I won't argue that for this specific use case, tail isn't friendlier than less.

But the original poster posted a useful tip, and is now getting aggressive downvotes and comments like "This has nothing to do with what is being discussed in this subthread." I think that's unwarranted.

MadcapJake · on March 30, 2015

less method:

  1. ma
  2. Reload browser
  3. 'a

less method with two marks:

  1. ma
  2. Reload browser
  3. mb
  4. 'a

ajanuary · on March 30, 2015

Unfortunately mark doesn't seem to work while you're following.

shangxiao · on March 30, 2015

A bunch of blank lines is a lot more grokable than timestamps lost in a bunch of text.