Lessons learned: A retrospective look at a year working in AI

In the middle of last year, I was hired as an AI Scientist, to work on problems in deep learning for computer vision. Before starting work as an AI Scientist, I had worked in computer vision for a few years, and before that did my PhD in electrical engineering, with a focus on numerical methods for computational physics problems, but I hadn’t worked directly in the AI field before. Now that the first year is coming to a close, I wanted to collect my thoughts and recollections about working as an AI researcher and share them in this blog. I’ve split this blog into three general categories:

Things I learned on the job
Things I learned about the field
Things I learned about myself

1. Things I learned on the job

How to write clean, efficient training code - Writing a good dataset/train/validate/test pipeline makes everything easier. The difference between my first model and my current models are night and day in terms of ease of use and modularity. Everybody has their own style when it comes to writing code, so one ends up sampling flavours from various other projects and adding dashes of one’s own coding style and preferences.
That pytorch is the best currently available deep learning framework. I started with keras, then tried both tensorflow 1.x and 2.x, and also dabbled in mxnet, before settling on using pytorch. Pytorch is the easiest to tinker with, and also has the biggest and most active community (and therefore the best ecosystem). How the pytorch developers stay so active on the forums (shoutout @ptrblck) and still have time to work, I’ll never know. Yes, tensorflow 2.0 has dynamic computational graphs now, but adoption amongst researchers who don’t work for Google is low. Ditto for mxnet and Amazon. Finally, a small but significant advantage: I found it much easier to get CUDA and GPU support working with pytorch compared to tensorflow. Let’s briefly talk specifics:
- Tensorflow 2.x, with its GradientTape, now feels a lot like pytorch. However, while it’s a million times better than tensorflow 1.x, it’s new enough that (as of late 2019/early 2020), there just isn’t a big community around it, and it doesn’t yet enjoy widespread adoption. I also found the documentation to be lacking when I used it in late 2019, and getting CUDA to work properly was more finicky than with pytorch.
- Keras, which is now a high-level API for tensorflow, is probably the easiest to use and quickest to get started with, but harder to tinker with. It also has a reasonably-sized community. I myself started with keras, years ago when it was still largely separate from tensorflow. I found it very easy to get the most common tasks up and running, but once I started wanting to do more advanced tasks, like writing my own loss functions and optimizers, it was very difficult to all of a sudden have to jump to tensorflow primatives, and often it was painful to get keras to perform conceptually simple, but non-standard operations.
- I don’t have a lot of experience with mxnet/gluon, but from what I can tell, it also codes a lot like pytorch, but is missing the widespread adoption and large community that pytorch enjoys.
That, with building more models, one starts to develop an intuition about how best to train models, things like which learning rates to use in which order. In the beginning, I used “standard” training techniques (i.e. use SGD with StepLR), but as I worked, I found that different models often shared training characteristics, and that I could manually control the optimizer between epochs to do much better than using an LR scheduler. Also…
It doesn’t hurt to try things, and you learn by trying. A lot of the time, I would try to incorporate novel training techniques and architectural features into my models, even if they didn’t really end up improving anything. These usually came from papers, but sometimes also from peoples’ blogs and from forum discussions. By doing this, it gave me the opportunity to read, learn and implement different ideas, and it also gave me a toolbox of tricks to use when I ran into problems in the future.
Benchmarking using established models is key, and saves time. For most real-world applications, you can get 90-99% of the way to the best accuracy possible with an established general-purpose model (like resnet50, UNet, FRCNN, etc), with pretrained weights. In my opinion, getting that last 1-10% by using a specially designed architecture, with all the extra training time, is often not worth it (at least from an engineering perspective… things are different if you’re specifically researching novel architectures). This leads to my next point…
The single most important factor in getting good accuracy is having pretrained weights from a large, general dataset. Nothing else matters nearly as much. A carefully designed, state-of-the-art model architecture, with all the training tricks thrown in, trained on a specific dataset from scratch, will still only be competitive with imagenet-pretrained resnet50, finetuned on that dataset. The problem with all the neat application-specific papers where they report 0.8% better accuracy than the state-of-the-art is, using that model without weights means you either need to spend a ton of resources pretraining, or you’ll have a heck of a time training on your smaller dataset without heavy overfitting.
Real-world datasets suffer from data imbalance and mislabeling. As a consequence, much of one’s time is better spent trying to clean up data, or using various training strategies to offset data imbalance and mislabeling (focal loss, label smoothing, etc). This leads me to my next point…
About 80% of my programming and design time was spent working on data manipulation, i.e. writing preprocessing and postprocessing pipelines, and only 20% was spent on the comparatively more glamorous modeling and training. It might be obvious from the title “data scientist”, but I feel like new people to the field might underestimate how much of the job is spent cleaning and massaging data.

2. Things I learned about the field

The “revolution” in convolutional neural networks architectures for computer vision happened in 2015/2016, with the development of ResNet, and in a sense we’re all still eating Kaiming He’s lunch (a similar revolution in NLP happened in 2018 with transformers). It’s hard to overstate how important ResNets are to modern CNNs. The key ideas behind ResNet, i.e. residual connections, and building deep networks by stacking modular blocks, are pretty much ubiquitous as of the writing of this blog. ResNet, or a variation of ResNet, is the backbone of many cutting-edge networks, and almost all modern CNNs can claim ResNet as an ancestor.
There are a lot more bad papers than good ones, and even among the “good” papers, there are a lot which are hard to read. As someone who came from a more old-school and established academic field, deep learning feels like a wild west frontier, where there’s a ton of new work, but with very mixed quality. I would say that, for any given deep learning application (like face recognition or something), there are only a few papers a year which are really worth paying attention to.
Implementation details are often lacking in papers. In my academic “upbringing”, I was taught to write papers so that my implementation was glaringly obvious, so that anyone who wanted to could easily follow what I did to reproduce my results, and more generally, so that it was difficult to misinterpret what I was trying to communicate. I found that it was common among AI papers for implementation details to be incomplete or confusing. Perhaps I, as a relative outsider, am missing some common knowledge which put the puzzles together, but I find this unlikely - at the point of writing this blog, I have worked on a comparatively large number of novel models. Perhaps AI researchers simply don’t find it as annoying as I do to finish reading a paper and be left with questions.
The future is in unsupervised and semi-supervised learning. The amount of data being generated in the world greatly outpaces the amount of data being labeled, and the more data you have, the better your model gets. From what I’ve seen this year, much of the interesting work in the AI field is and will be focused on leveraging unlabeled data to improve models. People are already doing this with various pretext tasks, which give models a better set of starting weights for training on labeled data, and companies like Google and Facebook have already been using unlabeled data in their products. It’s also worth noting that human and animal brains learn like this, with a lot of unlabeled data in the form of senses, and only a small amount of information which is labeled, or “taught”.

3. Things I learned about myself

That Python makes coding much more enjoyable for me. As someone who used C++ and C# a lot previously, I don’t think I could ever go back after using python as my primary language for a few years. Not that I have anything against C++ and C# – those languages have their purpose – but they weren’t designed with numerical and scientific programming in mind. Have you ever tried solving a linear algebra problem in C#? Even with the best available math libraries, it was often like pulling teeth. There are lots of nice things about Python, like how lists and dictionaries just MAKE SENSE, and work without a lot of fuss. This, combined with the large community and number of available libraries, makes it much better for prototyping than anything else out there.
I still learn best through doing. This is likely the engineering part of me showing, but I usually understand something much faster by implementing it, or a simpler version of it, than by reading theory. Theory is obviously important, but you’ll definitely realize something is wrong with your pen-and-paper work when you write the code and find that you have 99 equations but 101 unknowns. This was often true when I was working on deep learning models. I could much more easily understand how a model worked by starting with the smallest code details and working my way up, rather than starting with the general idea and working my way down.
I’m able to work completely remotely, though I don’t always enjoy it. The middle of my year working this job happened to coincide with the 2019/2020 coronavirus pandemic, which forced a lot of companies to allow people to work remotely. Of course, I was already fully working from home, so this only affected me in that my wife also started working from home full time. Before the pandemic hit, I found that, while working from home had its advantages (your commute was walking from your bed to your desk), I started missing the separation between my home and work life, and also missed having some human contact. I concentrate better when I’m in an environment that I associate with productivity. Our company leased a co-working office in downtown Toronto that we could use, and I found that working there two to three times a week was the ideal compromise – I could still be lazy and stay in, or I could commute to the office have a change of scenery when I needed it.
I finally “got” open source. Nowadays, whenever I write code that someone else might find useful and isn’t proprietary, I put it onto github. However, a few years ago, I mostly worked on C# software for specific business applications, and I subsisted on closed-source libraries provided by Microsoft. It was my current job that really opened me up to the world of open source, especially with Python. There are so many useful open-source projects where the maintainers are actively engaging with other developers, that I wanted to be able to contribute to other peoples’ code and make my own projects available.