Taking a deeper look at the massive $1T U.S. market opportunity for AVs
A friend forwarded me a link yesterday to something pretty amazing, it’s called BayesDB. Basically, it is a database architecture that detects predictive relationships between variables. Here’s how the guys who developed it at MIT describe their creation.
BayesDB, a Bayesian database table, lets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.
This is another huge step in the process of allowing regular developers and entrepreneurs to use data science to improve their products and understand relationships in their data. It sits along a very pronounced curve of infrastructure becoming more pre packaged and cheaper, allowing more people to succeed at building things, which is why you are currently seeing a technology BOOM.
At Estimize, we do a lot of quant work to understand the relationship between attributes of an analyst and their estimates, to accuracy. Which attributes give us a better confidence that they will be more accurate in the future. It’s extremely important to giving us confidence that if we have an open community, we can identify those analysts who deserve to be weighted more highly than the rest, regardless of their biographical background (unless that is a correlated factor).
But there are issues I want to bring up regarding BayesDB and it’s use.
When doing good data science, or quant finance, two things are extremely important to keep in mind. You need to start out with a hypothesis, you should not just be throwing all of the attributes into a database like BayesDB and allowing it to spit out the correlated factors. Data science is called data science, because it’s supposed to be data science, which means hypothesis, test, measure, conclusion.
And second, you need to do in and out of sample testing to make sure that you are not curve fitting or data snooping. Certain factors may be correlated over the course of the whole data set, but what if those factors chanced throughout the history of that time series? Do you have any confidence that they won’t change going forward? You need to take a portion of the time series, put it through BayesDB, then take the other portion of the series, put it through BayesDB, and see if the correlations hold. It’s always good to split this up two ways as well, take the first half of the time series and split it from the second half, and then also do a cross section of data from the whole time series.
Without having a hypothesis regarding correlated factors and why they are correlated, as well as doing in and out of sample testing to make sure you aren’t data snooping, BayesDB is a dangerous tool that can lead to bad conclusions. I hope that at some point they are able to build in the ability to do in and out of sample automatically without having to load two different sets and compare.
But let’s just marvel at how awesome this thing is to begin with. Hopefully another major step on the way to a smarter more predictable world.
Full Disclosure: Nothing on this site should ever be considered to be advice, research or an invitation to buy or sell any securities, please see the Disclaimer page for a full disclaimer.
