Jul 31, 2020

The Semi-Supervised Revolution

If you’re anything like us, semi-supervised learning may be a difficult concept to grasp. It’s a fairly new term in a fairly new field, as machine learning and data organization are still fields and concepts everyone is getting used to.

Semi-supervised learning is a new avenue in machine learning research, focusing on making inferences from large amounts of data. There isn’t yet one clear approach to semi-supervised learning, as new research and information on the topic comes out every day, but we do have a general idea of how it works and how to best implement it on a practical level and scale.

Here we will go over the differences between supervised and semi-supervised learning for machines, and how you can best implement semi-supervised learning to better your data management and business practices.

lines of code — Semi-supervised learning can be hard to grasp and implement, but not to fear- research is still coming out about this tool, and one day we’ll all be able to harness its power to organize and use data

Difficulties using Semi-Supervised Learning

We all go down the rabbit hole when first discovering the concept of ‘semi supervised learning’- make sure you learn how to use it to the best of its ability, not its worst

Machine learning engineers know too well what it’s like to have loads of data but not enough staff, time, money, and overall resources to organize and handle it in a proper fashion. You could always hire more staff to do the manual labor required of hand-organizing large amounts of data, but that’s ultimately a waste of time for you and your staff. They want to work on more important things and so do you. The key, then, is to discover and try to implement semi-supervised learning to automate your data organization.

When you have limited supervised data and piles of unlabeled data, you may try to Google what on earth you should do. MIT has some thoughts, sure, but the field is still so brand new and you’re likely to make mistakes without, well, the help of some MIT scientist.

Most engineers will then try to do it themselves, only to circle back to favoring regular manual data labeling. That’s because, unfortunately, performance seems to improve more with supervised data labeling than with semi-supervised. But because supervised data takes so much time to label and organize, at some point you’re likely to want to try semi-supervised learning.

If you’re running a low data system, semi-supervised learning can and does improve performance, but on a practical level the performance of the data cleaned is still unusable. To put it bluntly:

“Essentially, when you are in a data regime where semi-supervised learning actually helps, it means you’re also in a regime where your classifier is just plain bad and of no practical use.”

Likewise, there’s usually a literal cost to semi-supervised learning. It costs actual money, whereas you can label your own data for free. (And with a lot of time and coffee.) Furthermore, semi-supervised learning usually will not give you the same asymptotic properties that supervised learning does with high amounts of data- for example, unlabeled data can introduce bias, among other things.

Semi-supervised learning has changed a lot since it began, and we still don’t have a firm grasp on how to use this approach. An expert puts it this way:

“Even vastly improved modern generative methods haven’t improved that picture much, probably because what makes a good generative model isn’t necessarily what makes a good classifier. As a result, when you see engineers fine-tuning models today, it’s generally starting from representations that were learned on supervised data. Wherever practical, transfer learning from other pre-trained models is a much stronger starting point, which semi-supervised approaches have difficulty outperforming."

‍And that’s where we stand now with semi-supervised learning. It’s a great tool, but no one yet has a perfect grasp on how to harness its power. Not even the top data scientists in the world have figured out the trick yet. If you try to do it alone, be wary that you might become turned off to it as a tool. That’s only because you’re not using it right.

Next we’ll go over what usually happens when you try to do semi-supervised learning on your own. You might in fact have already tried this!

How NOT to Use Semi-Supervised Learning, and What’s Next

Here’s what usually happens when engineers try to use semi-supervised learning, and what you can do next if this happens to you

First, you start off where any data engineer starts off- with a massive pile of unlabeled data that’s only getting larger by the second. What should you do? Hiring more hands to do the manual labor of labeling and organizing data seems fruitless, and a waste of time and energy for everyone involved. After exhausting all possible avenues, you turn upon semi-supervised learning. It’s engineering, not just manually labeling data, so it must work, right?

When you first begin using semi-supervised learning, your performance numbers will go up. That’s almost a guarantee. The numbers are still not great, however. You may have to go back to hand-labeling some of your data.

Adding in more labeled data helps a tad- maybe you should drop all of your semi-supervised machinery, because this seems to be working.

After dropping semi-supervised learning, everything is now easier and faster. Maybe you shouldn’t have used semi-supervised learning at all. You may even have a raised performance characteristic, with your supervised data now running performance as high as your semi-supervised.

There is a small data pool where semi-supervised learning works and improves your efficiency, but it’s incredibly difficult to hit this spot in your data cleaning. Getting to this point simply takes too much time for not enough payoff- and you definitely don’t want to have to put in this type of work time and time again with all of your varied data sets.

At this point, it may seem like semi-supervised learning has failed and is useless.

But there’s hope ahead! If we can get our benchmark trends to look like this, we’ll have found the magic cure, the “sweet spot” if you will.

These curves match the idea, and truth, that more data should result in better performance. That makes logical sense, right? With more information, your performance should increase, as more knowledge increases your systems and their power. Likewise, the difference between your semi-supervised performance and your supervised performance should always be positive, even with data sets and regimes where supervised learning seems to always be on top. This ‘magic zone’ is where we want to be, and where we can get if we use semi-supervised learning the right way.

How did we get here? We must self-label our data and losses while keeping in mind the biases that can arise with self-labeling and supervised learning. For more information on how this works, check out MixMatch: A Holistic Approach to Semi-Supervised Learning and Unsupervised Data Augmentation for Consistency Training.

What’s Next for Semi-Supervised Learning?

Because semi-supervised learning is still so new, there’s much to be done in the field to enhance its uses. You may not have grasped how to implement it correctly into your data practices, but don’t worry, as one day semi-supervised learning will be better understood and thus easier to access for everyone.

Semi-supervised learning is also super important to machine learning and machine learning privacy. When supervised data is private, a la the PATE approach, privacy guarantees are trained using data that’s unlabeled and presumed public. More and more approaches are coming out that produce distributed learning without having to have access to private user data. This helps with privacy concerns in machine learning.

Keep on the lookout for semi-supervised learning information and news in the future. There’s still a ways to go before it becomes implemented in all machine learning and data practices, but in the meantime there’s much to read up on.

Semi-supervised learning truly is a revolution. Right now we’re at the beginning of it, where we’re still learning how to use it to its full ability. We are in the trial period, testing out semi-supervised learning on different data sets and seeing what happens. Once data scientists and engineers figure it out, you will never have to hand-label data again. Imagine the future- it’s happening right now.

Approved by

Joey Rahimi