EETimes – How Do You Protect Your Machine Learning Investment? (Part II) – – EE Times
Protection of the training set
Creating a good training set for a specific machine learning (ML) application can be a time-consuming and expensive effort. Although in a typical setting, an infringer has no direct access to the training set, the ability to copy the set is easy if the person has access. This is where IP law comes in.
If the owner of the training set has its principal place of business in the European Union, the training set would be protected with a database right. However, such right would only be enforceable against an infringer in the same jurisdiction.
How Do You Protect Your Machine Learning Investment? Part I
Whether copyright can be claimed on an ML training set is a more difficult question. A training set is not created to be a piece of art. The typical intention is to ensure the data fits the use case. Creating a well-fitting set of data on a topic is not a creative activity under copyright law. One potential for copyright claim concerns data classification descriptors. If categories are chosen through a creative process — “beautiful/ugly,” “strong/weak,” or “big/small” — then the training set could be said to be protected by copyright through this creative labeling. A classification based on factual elements — “cat/dog” or “traffic light/streetlight/parking sign” — does not impart creativity, therefore does not allow for copyright protection.
In some applications, training sets are generated by simulation or other artificial means. Arguably, these training sets could be copyright protected as the choice of how to simulate or generate could be seen as a creative one. However, to date, this has not been challenged in court.
Companies will often consider their training sets to be carefully guarded secrets. Since a training set is not required to be shared for the ML model to be used, it seems straightforward. The best approach is to both guard the training set from illicit copying and apply strong contractual restrictions to parties that must have the training set.
Protection of the training parameters
The training set and model are only a part of the value of a good ML system. The parameters that steer the training algorithm may also have value: choosing the right training parameters takes time and effort from highly trained engineers.
For the set of training parameters that create the ML system, copyright protection is a sensible approach. If a data scientist determining these parameters uses creative efforts to select the right training parameters, the resulting set of parameters would likely be protected by copyright. But if the training parameters were found through an exhaustive search (e.g., evaluating a number of options proposed in the literature) or an algorithmic process, no copyright would be available. The same would apply to the model that is produced using those training parameters and a given training set.
A database right is least likely on the parameter set because one criterion is that it must concern a collection of individual elements that are systematically or methodically arranged. A parameter set is unlikely to fit that criterion.
Protection of the architecture
The architecture of the system is the underlying foundation for the ML system. Its design is a key aspect of the proper functioning of the system. After training, the architecture can be put in practice.
A system like this has two aspects: the graph defining the architecture and the software implementing it. The graph is protected under the same conditions as given for the protection of the model parameters. Patents would theoretically be available for innovative hardware aspects of it, but this scenario is unlikely given most innovation in this area is purely software. The software that implements training and/or inference would typically be protected with copyright, as it is principally software designed using creative efforts.
Protection of the ML system
In theory, a computer system programmed with a well-chosen parameter set and trained on a specific training set could fall within the realm of patentable subject matter. However, current case law in Europe and the United States would require the system to be designed to perform real-world tasks such as steering a car or recognizing images from the real world. To date, it would be speculative to conclude a patent is obtainable on an ML system that operates in a more abstract manner, e.g., recognition and/or classification without a specific use case in the real world.
The software of the ML system could be protected by copyright just like any other software.
A database right for the ML system is theoretically arguable: in a way, the dataset is made searchable through the model and the software executing that model. However, this has never been decided in court or outlined in legal literature. The software of the ML system would definitely be protected by copyright just like any other software.
Burden of proof
Spotting an infringer and proving infringement in court are two very different things. The burden of proof in court cases on IP can be high to meet. As a general rule, the courts need to be convinced that it is very likely something was infringed. The alleged infringer has no obligation to cooperate in delivering this proof. Therefore, if some evidence under their control is needed, the IP rights holder may have a problem. Some jurisdictions allow for seizure of evidence or require parties to engage in a discovery process, but it is far from certain that doing so provides the rights holder the evidence needed. Under copyright law, if two items are very similar, then a court may reverse the burden of proof: the infringer then must show their work was independently created.
This is a very fact-specific analysis upon which a rights holder should not rely. Under trade secret law, a rights holder sometimes has the option to request the court to keep evidence secret or to get an independent party (such as a notary public) to compare evidence against the secret information without having the secret become part of public court records.
Protection of the model against copying
When an ML system is available without contractual or usage restrictions to the public, a unique way to copy its functionality becomes available. Essentially, the copyist has a dataset of unclassified items and submits each item to the ML system. Each answer is carefully recorded as the classification of the copyist’s dataset. The obtained labeled dataset can then be used to train a model of similar quality. It has been shown that this works effectively, even if the dataset contains non-problem domain data and if the architecture and model parameters of the target and clone do not match. Under copyright or database law, it is unclear if this act is legal or not. The dataset from the original ML system is not copied — only its output is used, and then only to label a different dataset.
If the dataset classification is creative in and of itself, the copyist may infringe that copyright by reusing the labels. This could even apply if only the labels are copied and reused to classify a completely independent dataset. However, this has never been tested in court.
Watermarking in machine learning
One practical aspect of IP law is that a rights holder has to prove that their rights have been infringed.
Proving ML models or training sets are copied may be exceptionally hard, especially when the data concerns real-world elements. The copyist can easily argue they merely collected the same or highly similar data from its original source or location. Without a way to counter that argument, the rights holder would be left with no recourse.
Watermarking is the process of embedding information in the content; the embedded information may not be apparent upon normal observation. The term “digital watermarking” was coined in 1992 and has been used by rights holders since the late 1990s for detecting and possibly tracing leaks of movies and songs. The embedded information of a digital watermark can reveal the source of the leak or the network that originally broadcast that content.
Watermarking has also found its way into ML, but the process is subtly different. The original training data and/or model are manipulated slightly to create certain unique properties of the model: the watermarks. For example, an image can be manipulated to add a logo to a certain spot. These watermarks can be detected by providing a secret, specially-crafted image including the same unique input to an ML system. An independently trained system would then classify that image as usual, but both the originally trained system as well as a system that copied the watermarked ML system would provide a unique output, triggered by the manipulation. This unique output would show that the system was copied from its original.
An additional benefit of such an approach is that the watermark can be used as a creative element, thus adding a piece of copyright-protected information to the ML system. This inclusion would help strengthen a copyright claim against a copyist.
The copyist could counterargue that they employed the same watermark independently or actually created the watermark themself. That would reverse the allegation of copying. To address such an argument, copyright owners must keep clear records of dates and times when the watermarks were chosen and inserted. Without good proof, a copyright holder will not be able to establish a claim of infringement.
Future of ML and IP
ML-driven business is gaining more and more traction. Interest in IP rights is also increasing to protect investments (from copyrights on training sets to patents on classification systems). Current IP law and practice is evolving, and case law is sparse. It’s uncertain how legal protection for ML-systems and ML-driven products will mature. However, some general indications are available:
How infringement cases will be judged and whether the law will change in these matters is speculative until there is a precedent set in court. Despite this, companies must consider now how to protect their ML IP.
About the author
Wil Michiels is a security architect at NXP Semiconductors who focuses on security innovations to enhance the security and trust of machine learning. Topics of interest include model confidentiality, adversarial examples, privacy, and interpretability. To learn more about NXP’s innovative solutions for Security and Machine Learning, visit nxp.com/ai
This material has been created in consultation with IT lawyer Arnoud Engelfriet, ICTRecht BV.