Software 2.0 2.0
Realizing Performance in Era of Deep Learning by Scaling with Data
This post is a rebuttal on Andrej Karpathy’s Software 2.0. It is brilliant and you should definitely read it but I thought the critical takeaway was lost in the many great points and I wanted to make that more explicit.
Traditional methods for solving a software problem involve people coming up with the required algorithmic steps to tackle it.
For example, think about a problem of estimating the depth of an object from an image. A traditional algorithm would contain the following steps
-
Gather 2 frames with enough baseline such that there is change within the scene but not too much
-
Find features such as SIFT, etc. on one frame, assuming the same features also exist on the other frame.
-
Find correspondence between these features, filter the correspondences and estimate a 2D flow
-
Derive relative depth from the flow
Scaling with data
Such an approach would work good enough in general and can be deployed for all kinds of scenarios which satisfy the approach’s approximations/assumptions but its performance would not depend on the amount of available data.
Performance of a traditional algorithm does not depend on the amount of avaiable data
How can we do better?
We can try to improve each of the approach’s steps and gain overall performance, and decades of research in Computer Vision does exactly that. But, by changing our fundemental thinking towards learning based approaches, unprecidented improvements were realized.
Era of learning
Learning based approaches replaced the handcrafted features (SURF in the above example) to derive these features from the data. Neural networks replaced the rest
Checkout my blog post on self-supervised learning for foundation models
Neural networks are function approximators
Traditional approach is to come up with a model/algorithm that converts an input to an output.
$ \text{input} \rightarrow \ model \ \rightarrow \text{output} $
Deep learning replaces this with 2 steps:
-
Training: Given many examples of input and output data, the model is learnt.
-
Deployment: The learnt model is used on new inputs to infer their outputs.
This allows the performance of the model to be dependent on the amount of training data available.
As the training data increases, the performance of a DL model increases but traditional methods do not depend on the training data
Production time
Traditional algorithms have to be almost always completely re-written from their prototyping phase to production while keeping the computational power and memory in check. At test time, Deep learning replaces all of this with just one model inference call. So, we go from re-implementing production level code to passing the input through a bunch of linear algebra. Neural networks cost same amount of memory and even better comoutational effeciency during inference time.
Algorithm 2.0ic thinking
-
First, to gain performance, it is not eough to copy paste one training example into multiple copies. The model needs to be intellignetly scaled by feeding in huge amounts of new data with much variation, representing the true population as truely as possible.
-
Secondly, new data doesn’t come for free. It is labor-intensive and expensive to label data. Smart gathering is required. One cannot just copy paste existing data and assume the data has increased. It has to increase meaningfully.
-
Thirdly, in order to be able to feed in that much amount of data, there needs to be infrastructure in place.
-
As in the case of traditional algorithms, one cannot just hand-engineer the algorithm and use it in deployment forever. One needs to continuously improve/train the learnt model and continuoiusly deploy it. This is also where the data can be selected smartly for the next iteration of improvement.
The holy grail? or just a fad?
For many applications, the performance gain is really not that critical. Other factors such as explainability, fail-safe, etc are equally important, which the current deeo learning approaches lack. The availability of data is another critical factor in deciding to choose learning based approaches.
If data is limited (green part), traditional algorithms provide better performance
Continuous improvement
But, if the performance gains are to be realized, scaling with data is the way to go.
Amount of data used to train notable AI systems. Source: Our World in Data
And the way to execute this smartly is by continuous improvement
Continuous Improvement loop