So storytime! I worked at Twitter as a contractor in 2008 (my job was to make internal hockey-stick graphs of usage to impress investors) during the Fail Whale era. The site would go down pretty much daily, and every time the ops team brought it back up, Twitter's VCs would send over a few bottles of really fancy imported Belgian beer (the kind with elaborate wire bottle caps that tell you it's expensive).
I would intercept these rewards and put them in my backpack for the bus ride home, in order to avoid creating perverse incentives for the operations team. But did anyone call me 'hero'?
Also at that time, I remember asking the head DB guy about a specific metric, and he ran a live query against the database in front of me. It took a while to return, so he used the time to explain how, in an ordinary setup, the query would have locked all the tables and brought down the entire site, but he was using special SQL-fu to make it run transparently.
We got so engrossed in the details of this topic that half an hour passed before we noticed that everyone had stopped working and was running around in a frenzy. Someone finally ran over and asked him if he was doing a query, he hit Control-C, and Twitter came back up.
> Could somebody explain why so much effort is being put into quant strategies, when it seems that real-world information gathering would be a much easier way to gain an edge over others?
I used to be part of a research group that sold the so-called "alternative data" you're describing to 30 or so hedge funds in the NYC area, including several of the largest. The example I like to give is that we knew well ahead of time that Tesla would miss on the Model 3 because we knew every vehicle they were selling by model, year, configuration, date and price with <99% accuracy. I still occasionally sell forecasts like this and the methodology is straightforward enough that even a solo investor can consistently beat the market if they know how to source the data. But I've mostly lost faith in this technique as the sole differentiator of a fund's alpha.
Some funds, like Two Sigma, have large divisions with a very sophisticated pipeline for this kind of analysis. They do exactly what you describe. For the most part it works, but there are several obstacles that keep this from being the holy grail of successful trading:
1. First and foremost, this analysis is fundamentally incomplete. You are not forecasting market movements, you're forecasting singular features of market movements. What I mean by that is that you aren't predicting the future state of a price; if the price of a security is a vector representing many dimensions of inputs, you're predicting one dimension. As a simple example, if I know precisely how many vehicles Tesla has sold, I don't know how the market will react to this information, which means I have some nontrivial amount of error to account for.
2. This analysis doesn't generalize well. If I have a bunch of information about the number of cars in Walmart parking lots, the number of vehicles sold by Tesla (with configurations), the number of online orders sold by Chipotle, etc. how should I design a data ingestion and processing pipeline to deal with all of this in a unified way? In other words, my analysis is dependent upon the kind of data I'm looking at, and I'll be doing a lot of different munging to get what I need. Each new hypothesis will require a lot of manual effort. This is fundamentally antagonistic to classification, automation and risk management.
3. It's slow. Under this paradigm you're coming up with hypotheses and seeking out unique and exclusive data to test those hypotheses. That means you're missing a lot of unknown unknowns and increasing the likelihood of finding things that other funds will also be able to find pretty easily. You are only likely to develop strategies which can have somewhat straightforward and intuitive explanations for their relationship with the data.
This is not to say the system doesn't work - it very clearly works. But it's also easy to hit relatively low capacity constraints, and it's imperfect for the reasons I've outlined. You might think exclusive data gives you an edge, but for the most part it does not (except for relatively short horizons). It's actually extremely difficult to have data which no other market participant has, and information diffusion happens very quickly. Ironically, in one of the very few times my colleagues and I had truly exclusive data (Tesla), the market did not react in a way that could be predicted by our analysis.
The most successful quantitative hedge funds focus on the math, because most data has a relatively short half-life for secrecy. They don't rely on the exclusivity of the data, they rely on superior methods for efficiently classifying and processing truly staggering amounts of it. They hire people who are extraordinarily talented at the fundamentals of mathematics and computer science because they mostly don't need or want people to come up with unique hypotheses for new trading strategies. They look to hire people who can scale up their research infrastructure even more, so that hypothesis testing and generation is automated almost entirely.
This is why I've said before that the easiest way to be hired by RenTech, DE Shaw, etc. is to be on the verge of re-discovering and publishing one of their trade secrets. People like Simons never really cared about how unique or informative any particular dataset is. They cared about how many diverse sets of data they could get and how efficiently they could find useful correlations between them. The more seemingly disconnected and inexplicable, the better.
Now with all of that said, I would still wholeheartedly recommend this paradigm for anyone with technical ability who wants to beat the market on $10 million or less (as a solo investor). A single creative and competent software engineer can reproduce much of this strategy for equities with only one or two revenue streams. You can pour into earnings positions for which your forecast predicts an outcome significantly at odds with the analyst consensus. You can also use your data to forecast volatility on a per-equity basis and sell options on those which do not indicate much volatility in the near term. Both of these are competitive for holding times ranging from days to months and, with the exception of some very real risk management complexity, do not require a large investment in research infrastructure.
Of course it's a bold statement. But if I wasn't bold, I wouldn't have started university at age 13, set three world records for calculating pi (a stunt, I admit), ranked in the top six mathematics undergraduates in North America, received a $100k+ scholarship to Oxford University (not the Rhodes, unfortunately -- their mistake), received a doctorate in computer science from said university, and become the security officer for the FreeBSD operating system.
There is a very fine line between authorized data, technically public but implicitly unauthorized data, and illegally obtained, unauthorized data. Here’s an example of each in the financial sector, from my personal experience:
1. Financial account aggregators and “budget apps” like Min monetize their business, in part, by selling huge amounts of data to the financial sector. Sometimes companies like Second Measure take raw data from companies like Yodlee and clean it, then resell it. Nowadays there is an entire industry of alterative market research that has had all sorts of participants, from Foursquare (locations) to Spark (email enhancement). This is technically authorized, because it’s in the TOS. The users effectively contribute their own data.
2. I developed an extremely accurate, reasonably generalizable method of forecasting vehicle production at several companies that relies on implementing a VIN searching algorithm in conjunction with legally required NHTSA recall lookup portals hosted by each manufacturer. This data is what you’d call unauthorized, because no entity explicitly endorses your use of it. For example, several colleagues and I knew well ahead of time that Tesla would miss on production of the Model 3s because they were utterly unrepresented in our data. But this data is public, so it’s fine to use from a legal and compliance standpoint. It was lucrative data specifically because it had a high signal for revenue, yet was hitherto unused and unidentified.
3. I once found, in the course of looking for legally usable data, an actual security vulnerability disclosing all users of a publicly traded QSR’s online delivery service, along with their phone numbers, email addresses and last four digits of credit cards. This is both unauthorized and illegal, because the data is contaminated with personally identifiable information and it clearly requires a vulnerability (not just scraping) to acquire.
I’ve seen overzealous data vendors accidentally slip from #2 into #3, which is really bad for all concerned. It’s not a great look for the vendor, who will likely be fired, and it represents a breach for the company who owns the data and its users. Any firm that has purchased the data will likely be contamined and be forced into a trading lockdown of that security for a period of time by compliance.
My real concern is that illicit data like this is used in machine learning research. Machine learning is already pretty frustrating - it’s common for me to find research from a conference that I’m simply unable to replicate because the training or experiment data is not available (this is annoyingly the case with A/B experiment optimization research put out by giant companies in particular). I worry that this trend of accepting machine learning research without any requirement for total data transparency will incentivize researchers to conduct their experiments using illicit data that doesn’t need to be sourced.
Been working on accounting systems in RPG and COBOL since ~1992. I also know C/X86ASM/Pascal/Delphi/VB/Fortran. Never bothered with C++ that much; played with Java a bit but Oracle irritates my bowels so moved away from that.
As mentioned in the article it's good work; but it is also not easy work. You tend to go through cycles of being pushed out to brought back under extreme emergency at any costs to get stuff working. Only for the cycle to repeat. Companies never think of the old guys as the ones to implement the new system - that's a job for the "enterprise experts" - I can't even keep track of how many "rewrites" I've seen in my life fail because of this.
We are the dinosaur club; but it's a club that pays extremely well (high 6 figures a year without working too hard if you are talented and have a good client base and reputation), but like fossil fuel one day it will all be gone ;)
I have intimate personal experience with the FCRA. Sadly I don't have an hour to talk about it at the moment, but ping me any time. Short version: it's one of the most absurdly customer-friendly pieces of legislation in the US, assuming you know how to work it. There exist Internet communities where they basically do nothing but assist each other with using the FCRA to get legitimate debts removed from their credit report, which, when combined with the Fair Debt Collection Practices Act, means you can essentially unilaterally absolve yourself of many debts if the party currently owning it is not on the ball for compliance.
The brief version, with the exact search queries you'll want bracketed: you send a [debt validation letter] under the FCRA to the CRAs. This starts a 30 day clock, during which time they have to get to the reporter and receive evidence from the reporter that you actually own the debt. If that clock expires, the CRAs must remove that tradeline from your report and never reinstate it. Roughly simultaneously with that letter, you send the collection agency a [FDCPA dispute letter], and allege specifically that you have "No recollection of the particulars of the debt" (this stops short of saying "It isn't mine"), request documentation of it, and -- this is the magic part -- remind them that the FDCPA means they have to stop collection activities until they've produced docs for you. Collection activities include responding to inquiries from the CRAs. If the CRA comes back to you with a "We validated the debt with the reporter." prior to you hearing from the reporter directly, you've got documentary evidence of a per-se violation of the FDCPA, which you can use to get the debt discharged and statutory damages (if you sue) or just threaten to do that in return for the reporter agreeing to tell the CRA to delete the tradeline.
No response from the CRA? You watch your mail box like a hawk for the next 30 days. Odds are, you'll get nothing back from the reporter in that timeframe, because most debt collection agencies are poorly organized and can't find the original documentation for the debt in their files quickly enough. Many simply won't have original documentation -- they just have a CSV file from the original lender listing people and amounts.
If you get nothing back from the reporter in 30 days, game over, you win. The CRA is now legally required to delete the tradeline and never put it back. Sometimes you have to send a few pieces of mail to get this to stick. You will probably follow-up on this with a second letter to the reporter, asserting the FDCPA right to not receive any communication from them which is inconvenient, and you'll tell them that all communication is inconvenient. (This letter is sometimes referred to as a [FOAD letter], for eff-off-and-die.) The reporter's only possible choices at that point are to abandon collection attempts entirely or sue you. If they sue you prior to sending validation, that was a very bad move, because that is a per-se FDCPA violation and means your debt will be voided. (That assumes you owe it in the first place. Lots of the people doing these mechanics actually did owe the debt at one point, but are betting that it can't be conveniently demonstrated that they owe the debt.)
If the reporter sends a letter: "Uh, we have you in a CSV file." you wait patiently until day 31 then say "You've failed to produce documentary evidence of this debt under the FDCPA. Accordingly, you're barred from attempting to collect on it. If you dispute that this is how the FDCPA works, meet me in any court of competent jurisdiction because I have the certified mail return receipt from the letter I sent you and every judge in the United States can count to 30." and then you file that with the CRA alleging "This debt on my credit report is invalid." The CRA will get in touch with the debt collection company, have their attempt timeout, and nuke the trade line. You now still technically speaking owe money but you owe it to someone who can't collect on the debt, (licitly [+]) sell it, or report it against your credit.
I just outlined the semi-abusive use of those two laws, but the perfectly legitimate use (for resolving situations like mine, where my credit report was alleging that I owed $X00,000 in debts dating to before I was born) is structurally similar. My dropbox still has 30 PDFs for letters I sent to the 3 CRAs, several banks, and a few debt collection companies disputing the information on my report and taking polite professional notice that there was an easy way out of this predicament for them but that if they weren't willing to play ball on that I was well aware of the mechanics of the hard way.
[+] Owing more to disorganization and incompetence than malice, many debt collection companies will in fact sell debts which they're not longer legally entitled to. This happened to me twice. I sent out two "intent to sue" letters and they fixed the problem within a week.
[Edit: I last did this in 2006 and my recollection on some of the steps I took was faulty, so I've corrected them above and made it a little more flow-charty.]
My offer (from Art Bass, then head of
Flight Operations in part because he was,
as the FAA required, a pilot) and offer
letter said that (A) there would be a
stock plan, (B) I would be part of the
stock plan, (C) the plan would be based on
salary in which case I would be quite high
up, (D) the Board was considering the
stock plan now and results were expected
in two weeks, (E) if the plan were not
equitable then the first plane out of
Memphis would be full of ex-Federal
Express employees.
With that, I joined, kept teaching my
courses in computer science at Georgetown
until the courses were over, at home got a
time sharing terminal, a CP67/CMS account,
etc., and dug into writing the software to
schedule the fleet.
Some Board Members, including one with a
lot of experience at American Airlines,
doubted there could be a schedule. So,
the Board wanted to see a schedule, say,
for the full, planned fleet of 33 planes
serving the full, planned list of 90 US
cities. Some crucial funding, some
equity, some loans on the planes, were
being held up waiting for the schedule.
The company was at risk.
I wrote the software, finished my
teaching, drove to Memphis, and rented a
room.
So, with the Board having doubts and the
company at risk, one evening Roger Frock
and I used my software to develop a
schedule for the 33 planes and 90 cities.
We printed out the schedule, had copies
made, and handed them around.
Board Member General Dynamics had sent two
representatives, one an aeronautical
engineer and one a finance guy, to
provide, say, adult supervision; those
two guys went over the schedule fairly
carefully and announced "It's a little
tight in a few places but it's flyable"
(pretty good from just a little fast work
from Roger and I); CEO Fred Smith's
reaction at the next senior staff meeting
was "Amazing document. Solves the most
important problem facing the start of
Federal Express.". The funding was
enabled. FedEx was saved. Pretty good
from typing in 6000 lines of PL/I in six
weeks while also teaching two courses!
PL/I is a nice language -- good on data
types, data manipulations, data
conversions, data structures, scope of
names, exceptional condition handling,
storage management, debugging, etc. E.g.,
its Based structures, can serve as a
poor-man's classes in object oriented
programming.
Later the Board wanted to see some revenue
projections. I didn't want to get
involved, but no one had more than wishes,
hopes, dreams, intentions, etc. So, I
started with the common high school
question, what do we know? Well, we knew
the present revenue or, if you will,
number of packages per day. From our
initial long term planning, we knew what
revenue we expected with 33 full airplanes
serving 90 US cities. So, in some at
least a somewhat meaningful sense the
desired projections were an
interpolation between those two facts we
did know.
Then the question was, how will the
interpolation go? Well, why will the
revenue grow? Sure: The revenue will
grow as it has been so far, customers to
be hearing about FedEx from current happy
FedEx customers. E.g., maybe a customer
to be gets a package via FedEx. So, we're
talking word of mouth advertising or
viral growth.
So, the rate of growth in revenue per day
or packages per day should be directly
proportional to (A) the number of current
customers and (B) the number of customers
to be. That is, the rate of growth should
be proportional to both (A) and (B), that
is, to their product.
So, all downhill from there: At time t,
let y(t) be the revenue, say, per day, at
time t. Let t = 0 correspond to the
present. So, we know y(0). Let b be the
revenue per day with a full system, that
is, 33 full airplanes and 90 US cities.
That is, we know both y(t) and b.
So, from freshman calculus, the rate of
growth is the first derivative of y(t),
that is, d/dt y(t) = y'(t). So, from the
proportionality, we have that there must
exist some constant k so that
y'(t) = k y(t) ( b - y(t) )
So, we have an initial value problem (we
know y(0)) for a first order, linear
ordinary differential equation.
Okay, how to get a solution? Easy, just
need freshman calculus, not even a course
in differential equations. And, yes,
there is a closed form solution, right,
with some exponentials.
Right, the solution is the famous
logistics curve sometimes seen as doing
well tracking, say, the growth of TV set
ownership in the early years of TV. So,
my derivation, as just above, can be seen
as an axiomatic derivation (maybe
rediscovery, maybe original) of the
logistic curve. The solution may remain
an okay, first-cut approach to
understanding viral growth.
So, I showed my work to Senior Vice
President Mike Basch, likely the one most
responsible for getting the projections
for the Board, and he liked my work. So,
on a Friday afternoon we picked several
candidate values for the constant k and
drew the corresponding graphs of the
revenue projections. We used my HP
calculator, reverse Polish notation, stack
machine, etc. -- HP should run an ad! We
picked a value of k that gave what seemed
to be reasonable projections and declared
the problem solved.
The HP? It's still in my center desk
drawer. Checking, right, it's an HP-35.
My wife and I paid $400 for it.
The next day, a Saturday, at about noon, I
was in my office likely working on fleet
scheduling and got a call from Roger Frock
asking if I knew anything about the
revenue projections. Saying I did, he
asked if I could come over to the HQ and
explain.
So, I got into my Camaro hot rod (396 big
block, etc.), and drove over. Yes, I
brought my HP-35.
As I arrived, at one of the old WWII
wooden hanger buildings, people were
standing around and not happy. Our two
guys from General Dynamics were standing
in the hall with their bags packed and not
happy.
Roger led me to a table with the graph,
picked a point in time, and asked me to
calculate the value on the graph. So,
with my HP-35, I punched the buttons,
stopped, slowed down, cleared the HP-35,
started again, slowly and carefully
punched the buttons again, and got the
value on the graph. I did that for
several points for the graph, and then
everyone started to get happy.
It turned out that the Board meeting had
been that morning; Mike Basch was
traveling; I'd not been invited to the
Board meeting; the graph had been
presented; the two guys from General
Dynamics (GD) had asked how the graph had
been calculated; and everyone else at the
meeting dug in trying to answer. They
worked for hours with no results. Finally
the two guys from GD lost patience with
FedEx, returned to their rented rooms,
packed their bags, got plane reservations
back to Texas, and as a last chance
returned, with their packed bags, to the
FedEx HQ to see if anyone could explain
the projections.
Somehow Roger Frock had guessed that I'd
done the projections, called me, and got
me to the Board meeting just in time.
It was close, but I'd saved FedEx a second
time.
Right: Some people in FedEx would rather
have seen FedEx go under than invite me to
the Board meeting. We're talking some
severe cases of jealousy, bureaucratic
infighting, attacking the guy down the
hall instead of the competition outside of
the building, goal subordination as in
organizational behavior, etc., right?
Bummer.
Right: Apparently I was the only person
at FedEx who still understood freshman
calculus. Gads. And I never even took
freshman calculus, taught it to myself
from a book, and started with sophomore
calculus.
I never got any thanks for saving the
company the second time.
I'd been at FedEx for over a year. I had
been commuting every few weeks home to
Maryland for a few days at a time -- not
good. There had been no more about the
stock that had been supposed to come in
"two weeks". The company had some
problems, e.g., had nearly gone out of
business due to not inviting me to the
Board meeting. Also the basic planning
was way off -- the planning said that we
could fly the planes around half full and
nearly print money, but we were flying the
planes packed solid, had doubled the
rates, and still were losing money --
bummer.
So that I could be a good bread winner in
my marriage and for our kids if we could
have some, as we hoped, I wanted something
valuable no one could take away from me, a
Ph.D. for my career and/or stock.
So, I'd gotten accepted for an appropriate
Ph.D. at Brown (Division of Applied
Mathematics), Cornell, Princeton, and
Johns Hopkins.
The oil crisis hit. Saving money,
especially on jet fuel, became a biggie.
So, I was working on that. I was getting
a lot of flack from others, especially my
manager.
Finally I called a meeting to explain what
I was working on, three projects. My
manager said I couldn't do that because he
was busy and couldn't come. I told him,
fine, then don't come.
He came. So did Fred, Roger Frock, Art
Bass, the top 15 or so people in FedEx.
My manager was sitting next to Fred and
kept objecting to what I was saying until
Fred told him to cool it.
One of my problems was to use
deterministic optimal control theory to
say how to climb, cruise, and descend the
planes.
A second problem was to use 0-1 integer
linear programming set covering to develop
schedules that would save on OpEx and
maybe also CapEx.
A third problem was how to buy fuel during
a trip from Memphis and back. So, broadly
the idea was to buy extra fuel where it
was cheap and carry it to the next stop or
two where fuel was more expensive. We
were getting fuel for $0.16 a gallon in
Memphis but paying up to $0.55 cents a
gallon (in Nashville). So, that's a case
of what has long been known as fuel
tankering. But doing that interacts with
how to climb, cruise, and descend the
airplane, not being late in the schedule,
loads, weather, air traffic control, etc.
And typically a lot of the cheap fuel gets
burned off just from trying to carry it,
and how much gets burned off has a lot to
do with the flight plan. And any such
decision to buy extra fuel is a bet on
the future of the trip back to Memphis,
that is, a bet against the random
package loads, weather, air traffic, etc.
So, how the heck to solve that? And, for
various reasons, couldn't get a solution
from carrying a computer on the plane and,
really, not even from using a computer on
the ground after landing. I'd found a
way!
So, Fred put me under Senior VP of
Planning Mike Basch and, thus, made me
Director of Operations Research.
But the fall came, and I had to decide
actually to leave for graduate school or
not. With no stock, not a lot of thanks,
with a lot of scars from being attacked,
still away from my wife, the company still
at risk, I decided to go to graduate
school. I liked FedEx, the challenges,
the work, etc., but making the
stockholders rich, with me not one of the
stockholders, while wrecking my marriage
and passing up the chance for a Ph.D. that
might help my career and that no one could
take away from me looked not good. If I
couldn't get stock with the company still
at risk and worked to make the company
valuable, then what hope would I have of
getting stock in the company I'd helped
make valuable before getting any stock?
I went home to Maryland. At the last
moment, Fred wanted me back in Memphis.
He and I met with Mike Basch, and Fred
said, "You know, if you stay, then you are
in line for $500,000 in Federal Express
stock?". Heck no; I didn't "know" any
such thing; I had had and accepted such
promises before, "two weeks", and after 18
months, saving the company twice, and with
three projects to do much more for the
company, all there was were more such
promises, not on paper that a lawyer could
do something with, no thanks -- "Fool me
once, shame on you. Fool me twice, shame
on me.".
Sure, that $500,000 would be ballpark $50
million to $500 million today. And
apparently some people did get some stock.
But there that last day, Fred still was
just not putting it down on paper.
Since then I ran all this past a lawyer
who concluded, "Legally FedEx owes you
nothing. Morally they owe you
everything.".
So, here on HN, maybe I definitely should
tell this story as I have so that others
can benefit so that more promises of stock
can become ownership of stock.
Of course, there is a lot more to getting
wealthy from stock in a startup than what
I've outlined here.
Broad Lesson: The broad lesson for people
in startups with promises of stock, become
very well informed and be very careful.
My reaction: Do my own startup. Doing
it. Need to get back to it. It'd be fun
to make more money than Fred! I have a
shot! Back to it!
I would intercept these rewards and put them in my backpack for the bus ride home, in order to avoid creating perverse incentives for the operations team. But did anyone call me 'hero'?
Also at that time, I remember asking the head DB guy about a specific metric, and he ran a live query against the database in front of me. It took a while to return, so he used the time to explain how, in an ordinary setup, the query would have locked all the tables and brought down the entire site, but he was using special SQL-fu to make it run transparently.
We got so engrossed in the details of this topic that half an hour passed before we noticed that everyone had stopped working and was running around in a frenzy. Someone finally ran over and asked him if he was doing a query, he hit Control-C, and Twitter came back up.