Changing artificial intelligence with blockchain: Decentralization brings a new data paradigm

In recent years, artificial intelligence (AI) researchers have finally made breakthroughs in areas they have been working on for decades, from the game of Go to human-level speech recognition. The key to achieving breakthroughs is that researchers can collect huge amounts of data and "learn" from this data, thereby reducing the error rate to an acceptable level.

In short, big data has revolutionized the development of artificial intelligence, pushing it to an almost unbelievable level.

Blockchain technology can also revolutionize AI—in its own specific way, of course. Some of the ways blockchain is used for AI are currently limited, such as audit trails on AI models. Some applications are almost unbelievable, such as having your own AI—an AI DAO. These are opportunities for development. This article will explore these applications in detail.

Blockchain as a Blue Ocean Database

Before discussing these applications, let’s first understand the differences between blockchain and traditional big data distributed databases (such as MongoDB).

We can think of blockchains as “blue ocean” databases: they escape the “red ocean” of existing market sharks and instead enter a blue ocean without market competition. Famous examples of blue oceans are the Wii video game console (which compromised raw performance but added new modes of interaction), or Yellow Tail wine (which ignored the fussy norms of wine lovers and made wine more accessible to beer lovers).

By traditional database standards, traditional blockchains (like Bitcoin) are terrible: low throughput, low capacity, high latency, poor query support, etc. But in Blue Ocean Thinking, this is acceptable because blockchain introduces three new features: decentralization/shared control, immutability/audit trail, and native assets/exchange. People inspired by Bitcoin are happy to overlook the shortcomings of traditional database-based systems because these new benefits have the potential to impact entire industries and societies in entirely new ways.

These three new “blockchain” database features are also potentially useful for AI applications. But most practical AI work involves large amounts of data, such as large dataset training or high-throughput stream processing. Therefore, for blockchain applications in the field of AI, blockchain technology with big data scalability and query is needed. Emerging technologies like BigchainDB and its public network IPDB (Internet Pinball Machine Database) do just that. This makes it possible to gain the benefits of blockchain without having to give up the advantages of traditional big data databases.

Overview of Blockchain for Artificial Intelligence

Large-scale blockchain technology unlocks its potential for artificial intelligence applications. Let’s explore this potential, starting with three benefits of blockchain.

These blockchain benefits present the following opportunities for AI practitioners:

Decentralized/shared control incentivizes data sharing:

(1) Bring more data, so better models can be trained.

(2) Bringing new qualitative data and therefore new qualitative models.

(3) Allowing sharing of training data and models for controlling AI.

Immutability/Audit Trail:

(4) Provides guarantees for training/testing data and models, improving the credibility of data and models. Data also needs reputation.

Local Assets/Exchanges:

(5) Making training/testing data and models into intellectual property (IP) assets can lead to decentralized data and model exchange and better control over the upstream use of data.

There is another opportunity: (6) AI and blockchain unlock the possibility of AI DAOs (Decentralized Autonomous Organizations). These AIs can accumulate wealth. To a large extent, they are Software-as-a-Service.

Blockchain can also help AI in many more ways. In turn, AI can help blockchain in many ways, such as mining blockchain data (e.g. Silk Road investigation). This is another topic for discussion :)

Many of these opportunities are about AI’s special relationship with data. Let’s explore that first. After that, we’ll explore blockchain’s use in AI in more detail.

Artificial Intelligence & Data

Here I will describe how modern AI uses large amounts of data in order to produce good results. (This isn’t always the case, but it’s common enough that it’s worth describing.)

The History of “Traditional” AI & Data

When I started doing AI research in the 90s, a typical approach was:

Find a fixed (usually small) dataset.

Design an algorithm to improve performance, such as designing a new kernel function for a support vector machine classifier to improve the AUC value.

Publish the algorithm at a conference or in a journal. The “minimum publishable improvement” is only a 10% relative improvement, as long as your algorithm itself is fancy enough. If your improvement is between 2x-10x, you can publish in the best journals in the field, especially if the algorithm is really fancy (complex).

If this sounds academic, that’s because it is. Most AI work is still in academia, even though there are real-world applications. In my experience, this is true for many subfields of AI, including neural networks, fuzzy systems, evolutionary computation, and even not-so-AI techniques like nonlinear programming or convex optimization.

In my first published paper, "Genetic Programming with Least Squares for Fast, Precise Modeling of Polynomial Time Series" (1997), I proudly showed that my newly invented algorithm achieved state-of-the-art results on a minimal fixed dataset compared to state-of-the-art neural networks, genetic programming, etc.

Towards Modern AI & Data

But the world has changed. In 2001, Microsoft researchers Banko and Brill published a paper with remarkable results. First, they described the situation where most work in natural language processing was based on small datasets of less than 1 million words. In this case, the error rate was 25% for old/boring/not so fancy algorithms like Naive Bayes and Perceptrons, while the fancy newer memory-based algorithms achieved an error rate of 19%. These are the four leftmost data points below.

So far, nothing surprising. But Banko and Brill revealed something remarkable: As you add more data—not just a little bit, but multiples of it—and keep the algorithm the same, the error rate keeps going down a lot. By the time the data set is three orders of magnitude larger, the error is less than 5%. In many fields, that’s the difference between 18% and 5%, but only the latter is good enough for practical applications.

Furthermore, the best algorithms are the simplest; the worst algorithms are the fanciest. The boring perceptron algorithm from the 1950s is beating the state of the art.

Modern AI & Data

Banko and Brill aren’t the only ones to have spotted this pattern. In 2007, for example, Google researchers Halevy, Norvig, and Pereira published a paper showing how data can be “unreasonably effective” across many areas of artificial intelligence.

This hit the field of artificial intelligence like an atom bomb.

Data is the key!

So the race to collect more data begins. It takes a lot of effort to get good data. If you have the resources, you can get the data. Sometimes you can even lock up the data. In this new world, data is the trench and AI algorithms are a commodity. For these reasons, “more data” is key for companies like Google, Facebook, and others.

"More data, more wealth" - everyone

Once you understand these dynamics, the specific actions have simple explanations. Google didn’t buy a satellite imaging company because it liked satellite images; Google open-sourced TensorFlow.

Deep learning is directly applicable to this situation: given a large enough dataset, it can figure out how to capture interactions and latent variables. Interestingly, back-propagation neural networks from the 1980s can sometimes rival state-of-the-art techniques given the same large dataset. See the paper Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition. So data is key.

My own maturation as an AI researcher was similar. When I encountered real-world problems, I learned how to swallow my pride, abandon “cool” algorithms, settle for just the ones that solved the problem at hand, and learned to love data and scale. We shifted our focus from creative design of automation to “boring” parameter optimization; and we became less boring in our haste when users asked us to go from 10 variables to 1,000 and more variables — this was the case with my first company, ADA (1998–2004). We shifted our focus from fancy modeling methods to super simple but fully scalable machine learning algorithms (like FFX); and we became less boring when users asked us to go from 100 variables to 100,000 and from 10 billion Monte Carlo samples to 1 billion (effective samples) — this happened with my second company, Solido (2004–present). Even the product of my third and current company, BigchainDB, reflects the need for scale (2013–present). Expand functionality, expand scale.

Opportunity 1: Data Sharing → Better Models

In summary: decentralization/shared control leads to data sharing, which in turn leads to better models, higher profits/lower costs/etc. Here’s the elaboration:

AI loves data. The more data, the better the model. However, data is often siloed, and especially in this new world, data can be an insurmountable chasm.

But if there are enough positive benefits, blockchains encourage data sharing between traditional independent entities. The decentralized nature of blockchains encourages data sharing: if there is no single entity controlling the infrastructure that stores data, there will be less conflict in sharing. I will list more benefits later.

Data sharing might occur within an enterprise (e.g., across regional offices), within an ecosystem (e.g., a “consortium” database), or across the planet (e.g., a shared planetary database, i.e., a public blockchain).

Here are examples of each:

In-house : Use blockchain technology to merge data from different regional offices, as it reduces the cost for the enterprise to audit its own data and share data with auditors. With the new data in place, the enterprise can build AI models that, for example, better predict customer churn than the previous model built only at the regional office level. A "data mart" for each regional office?

In-Ecosystem : Competitors (e.g., banks or record labels) who would never share their data in the past might now be able to show that combining data from several banks can lead to better models for preventing credit card fraud. Or supply chain organizations sharing data via blockchain, using AI on data earlier in the supply chain, can better identify the root causes of failures in the supply chain. For example, where exactly did that strain of E. coli appear?

On a planetary scale (public blockchain databases): consider data sharing between different ecosystems (e.g. energy usage data + auto parts supply chain data); or individual participants in a planetary scale ecosystem (e.g. the web). More data sources can improve models. For example, a spike in energy usage at some factory in China could be related to illegal auto parts spending a day in transit to the market. In general, we see companies aggregating data, laundering it, repackaging it, and selling it; from Bloomberg terminals to dozens (or hundreds) of startups selling data via http APIs. I’ll cover this future later.

Enemies share their data to feed an AI. 2016 How interesting!

Opportunity 2: Data Sharing → New Models

In some cases, when separate data sets are combined, you don’t just get a better dataset, you get a new dataset. This can lead to entirely new models from which you can glean new insights and new business applications. In other words, you can do things you couldn’t do before.

Here is an example for identifying diamond fraud. If you are a bank that provides insurance on diamonds, and you want to develop a classifier to identify whether a diamond is fraudulent or not. There are four trusted diamond certification laboratories on the planet (depending on who you ask, of course). If you only have access to diamond data from one of the labs, then you cannot see the data from the other three, and your classifier may easily mark the diamonds of the other three as fraudulent (see the figure below, left). Your false positive rate will make your system unusable.

If instead the blockchain facilitates four certification labs to share their data, you have all the legitimate data from which you will build a classifier (bottom right). Any input diamond, for example seen on eBay, will go through the system and be compared to each of the classes in the classifier. This classifier can detect true fraud and avoid false positives, thereby reducing the false positive rate, benefiting the insurance provider and the certification lab. This can simply be used as a lookup box, i.e. no AI is needed. But using AI further improves it, for example predicting price based on color, carat, and then using "how close is the price to value" as input to the main fraud classifier.

The second example here is that a proper token-incentive scheme in a decentralized system could incentivize previously unlabeled datasets to get labeled, or labeled in a more economical way. This is basically decentralized Mechanical Turk. With new labels, we get new datasets; we train with new datasets to get new models. The third example is that token incentives could lead to direct data input from IoT devices. These devices control the data and can exchange it for assets, such as energy. Again, this new data could lead to new models.

Hoarding vs. Sharing? There is a tension here between two opposing motivations. One is to hoard data - the idea that "data is the new moat"; the other is to share data in order to get better/new models. There must be a sufficient driving force for the sharing behavior to exceed the "moat" benefits. The technical driving force is to get better models or new models, but this driving force must have business value. Possible benefits include reducing insurance savings fraud in raw materials or supply chains; using Mechanical Turk as a profitable sideline; data/model exchange; or collective action against powerful core players, just like the record companies joined forces to fight Apple's iTunes, etc.; it requires creative business strategies.

Centralized vs. Decentralized? Even if some organizations choose to share data, they can do so without blockchain technology. For example, they can simply store it in an S3 instance and provide an API. But in some cases, decentralization brings new benefits. The first is the direct sharing of infrastructure, so that no single organization in the sharing consortium controls all of the "shared data" on its own. (This was a major obstacle a few years ago when record companies tried to work together for a public registry.) Another benefit is making it easier to turn data & models into assets, which can then be licensed externally for profit. I will elaborate on this below.

As mentioned earlier, data & model sharing happens at three levels: within a company (this is harder than you think for multinationals); within an ecosystem or consortium; or on the planet (the equivalent of becoming a public utility). Let’s explore this planetary scale of sharing in more depth.

Opportunity 2A: New data at the planetary level → New insights at the planetary level

Data sharing at the planetary level is perhaps the most interesting. Let’s dig a little deeper into this.

IPDB is structured data at a global scale, not in pieces. Think of the World Wide Web as a file system on the Internet; IPDB is its database copy. (I think the reason we haven't seen more work on this is that the semantic web work tries to do it in terms of upgrading a file system. But building a database by "upgrading" a file system is pretty hard! It's more effective to say from the beginning that you're building a database and designing it or something.) "Global variable" gets a more literal interpretation :)

So what does it look like when we have a planet-scale, shared database service like IPDB? We have a few reference points.

The first reference point is that there has recently been a billion dollar market in the corporate world for curating and repackaging public data to make it more consumable , from simple APIs for weather or web time, to financial data APIs for stocks and currencies. Imagine all this data being accessible through a single database in a similarly structured way (even if it's just a pass to an API). It's like having 1,000 Bloombergs. No worries about being beholden to a single entity.

The second reference point comes from blockchain, the concept of “oraclizing” external data through a blockchain to make it easily consumable . But we can oraclize everything. Decentralized Bloomberg is just the beginning.

Overall, we get a whole new scale of data sets and diversity of data sources. So we have qualitatively new data. Structured data at the planetary level. So we can build qualitatively new models from this, making connections between inputs & outputs that were not previously connected. And with models, we will get qualitatively new insights.

I wish I could be more specific here, but it's so new that I can't think of any examples. They'll appear, though!

There will also be a Bot angle. We have been assuming that the primary consumers of blockchain APIs will be humans. But what if it is machines? David Holtzman, the creator of modern DNS, recently said that "IPDB is kibbles for AI." In analysis, this is because IPDB enables and encourages planetary-level data sharing, and AI really likes to eat data.

Opportunity 3: Audit trails in data & models make predictions more trustworthy

This application addresses the fact that if you train with garbage data, you will get a garbage model. The same is true for testing data: garbage in, garbage out.

Garbage can come from malicious actors/Byzantine faults that may have tampered with the data. Think of the Volkswagen emissions scandal. Garbage can also come from non-malicious actors/crash faults, such as a defective IoT sensor, a bad input data, or a little fluctuation caused by environmental radiation (without good error correction behavior).

How do you know that the X/y training data is free of defects? What about live use, running the model on live input data? What about the model predictions (yhat)? In short: what happened to the data going into the model and coming out of the model? Data also has a reputation.

Blockchain technology can help. Here’s how. Models are built at each step of the process and run in the field. The creator of the data can simply timestamp the model with a blockchain database, including a digital signature stating “I believe this data/model is fine at this point.” More specifically…

Modeling source:

Sensor data (including IoT). Do you trust what your IoT sensors are telling you?

Training input/output (X/y) data.

Modeling itself, for example you can use trusted execution infrastructure, or TrueBit-style markets for double-checking computations. At least modeling evidence of model convergence curves (e.g. nmse* vs. epoch).

The model itself.

Test process/sources in the field:

Test input (X) data.

Model simulation. Trusted Execution, TrueBit, etc.

Test output (yhat) data.

We can get provenance in the process of model building and application. The result is more reliable AI training data & models. We can also have a chain structure of models of models, just like in semiconductor circuit design all the way down. Now everything has a provenance.

Benefits include:

Catching vulnerabilities in the data supply chain at all levels (in the broadest sense). For example, you can tell if a sensor is lying.

You know the origin of the data and models, and do so in a cryptographically authenticated manner.

You can find holes in the data supply chain. That way, if something goes wrong, we have a better understanding of where it is and how to respond. You can think of it as a bank-style settlement, but for AI models.

Data has a reputation because multiple eyes can examine that source and even make their own claims about how valid the data is. Correspondingly, models have a reputation as well.

Opportunity 4: Global shared registry for training data and models

But what if we had a global database that could easily curate another dataset or data feed (free or otherwise)? This would include a bunch of Kaggle datasets from various machine learning competitions, the Stanford ImageNet dataset, and countless others.

This is exactly what IPDB can do. People can submit datasets and use other people's data. The data itself will be in a decentralized file system, like IPFS; and the metadata (and the data pointer itself) will be in IPDB. We will get a global shared space for AI datasets. This helps realize the dream of building an open data community.

We don't have to stop at the dataset level; we can also include models built from those datasets. It should be easy to grab and run other people's models and submit your own. A global database would greatly facilitate this process. We can get the models that the planets have.

Opportunity 5: Data & Models as IP Assets → Data & Model Exchange

Let's build on this application of a "global shared registry" of training data and models. Data & models can be part of the public commons. But they can also be bought and sold!

Data and AI models can be used as intellectual property (IP) assets because they are protected by copyright law. This means:

If you create data or models, you can claim copyright, regardless of whether you want to do any commercial activities.
If you own the copyright to your data or model, you can license the rights to use it to others. For example, you can license your data to others to build their own models. Or you can license your model to others to include in their mobile apps. Sublicenses, sub-sublicenses, etc. are also possible. Of course, you can also license data or models from others.

I think it’s pretty cool to be able to copyright an AI model and license it. Data is already recognized as a potentially huge market; models will follow. Copyrighting and licensing data & models was possible before blockchain. The laws have been around for a while. But blockchain makes it better because:

Copyright notices provide a tamper-proof global public registry; your copyright notice is a digital/encrypted signature. This registry can also include data & models.
It also provides a tamper-proof global public registry for your authorized transactions. This time it’s not just digital signatures; instead you can’t even transfer rights unless you have the private key. The transfer of rights is done as a blockchain-style asset conversion.

IP on the blockchain is something close to my heart, as I worked on ascribe in 2013 to help digital artists get paid. The original approach had issues with scale and licensing flexibility. These have now been overcome, as I wrote recently about. The technology includes:

Coala IP is a flexible, blockchain-friendly IP protocol.
IPDB (and BigchainDB) is a public blockchain shared database used to store rights information and other network-scale metadata.
IPFS + physical storage (such as Storj or Filecoin) is a decentralized file system for storing big data & model blobs.

With this, we have data and models as IP assets.

For example, using ascribe, I claimed the copyright of an AI model I built a few years ago. The AI model is a CART (decision tree) that decides which analog circuit topology to use. Here is its cryptographic Certificate of Authenticity (COA).

Once we have data and models as assets, we can start exchanging assets.

An exchange can be centralized, as DatastreamX does with data, but so far they have really only been able to use public data sources because many businesses feel the risks of sharing outweigh the benefits.

What about decentralized data & model exchanges? Decentralizing the data shared in the "exchange" process has new benefits. Decentralization does not have a single entity controlling the data storage infrastructure, nor does it have a ledger of who owns what, which makes it easier to organize collaboration or data sharing, as mentioned earlier. For example, OpenBazaar for Deep Nets.

With such a decentralized exchange, we will see the emergence of a true open data marketplace. This fulfills a long-held dream of the data and AI community (including yours truly).

Of course, there will be some AI algorithm-based transactions on top of these exchanges: using AI algorithms to buy AI models. AI trading algorithms will even look like this: buying algorithms to trade AI models, and then updating them yourself!

Opportunity 5A: Control your data & models upstream

This is a repeat of the previous app. When you log into Facebook, you grant it very specific rights, including the right to do with any data you enter into its systems. It applies permissions to your profile.

When a musician signs with a label, they are granting the label very specific rights: to edit the music, to distribute it, etc. (Often the label will try to grab all the copyrights, which is a very onerous task, but that’s another story!)

The same goes for AI data and AI models. When you create the data that can be used for modeling, as well as when you create the models themselves, you can specify permissions upfront to restrict what others can do with it upstream.

For all use cases, from personal data to music, from AI data to AI models, blockchain technology makes this process much easier. In a blockchain database, you have rights as assets, such as a read permission or the right to view a piece of data/model. You as the rights holder can transfer these rights as assets to others in the system, similar to Bitcoin transfers: create a transfer transaction and sign it with your private key.

With this, you have better upstream control over your AI training data, your AI models, etc. “For example, you can blend this data but not do deep learning.”

This is similar to some of the strategies that DeepMind is using in its healthcare blockchain project. Medical data poses the risk of regulatory and antitrust issues in data mining (especially in Europe). But if users can truly own their medical data and control its upstream use, then DeepMind can simply tell consumers and regulators "Hey, in fact, customers own their data, we just use it." My friend Lawrence Lundy provided this great example, and then he further inferred:

It’s entirely possible that the only way governments will allow data privacy (human or AGI) is with a data-sharing infrastructure, with “net neutrality” rules, like AT&T and the original phone lines. In this sense, increasingly autonomous AI will require governments to embrace blockchain and other data-sharing infrastructure to be sustainable in the long run. - Lawrence Lundy

Opportunity 6: Decentralized Autonomous Organization (DAO) — AI that can accumulate wealth and cannot be shut down

This is a lie. An AI DAO belongs to the AI itself, and you can’t shut it down. I will summarize the “how to do it” below. Interested readers can continue reading to delve deeper into this topic.

So far, we've talked about blockchains as decentralized databases. But we can also decentralize processes: basically the storage of state of a state machine. This is much easier to do with some infrastructure around it, and that's the essence of "smart contracts" technology (like Ethereum).

We've decentralized processes before in the form of computer viruses. No single entity owns or controls them, and you can't shut them down. But they have limitations - they mostly try to break into your computer, and that's it.

But what if you could have richer interactions with the process, and the process itself could accumulate wealth? Currently, this is possible through the use of better APIs in the process (like smart contract languages) and decentralized stores of value (like public blockchains).

A DAO is a process that embodies these characteristics. Its code can hold its own.

What brings us to AI? Most likely, the subfield of AI called Artificial General Intelligence (AGI). AGI is about autonomous agents interacting with their environment. AGI can be modeled as a feedback control system. This is good news, because control systems have many advantages. First, they have a deep mathematical foundation dating back to the 1950s (Wiener's Cybernetics). They capture interaction with the world (actuation and sensing) and adapt to it (updating state based on internal models and external sensors). Control systems are widely used. They determine how a simple thermostat adjusts to a target temperature. They cancel out the noise in expensive headphones. They are at the heart of thousands of devices, from ovens to the brakes in your car.

The AI community has become more accepting of control systems lately. They are key to AlphaGo, for example. And AGI itself is a control system.

An AI DAO is an AGI-style control system running on decentralized processing & storage. Its feedback loop will continue on its own, taking input, updating state, executing output, and using these resources over and over again.

We can start with an AI and get an AI DAO (an AGI agent) and make it decentralized. Or we can start with a DAO and give it the decision-making capabilities of an AI.

AI gets its missing link: resources. DAO gets its missing link: autonomous decision making. Because of this, the use case of AI DAO is much larger than AI or DAO itself. Its potential impact is also multiplied.

Here are some applications:

An ArtDAO, create your own digital art and sell it. In a nutshell, it can do 3D designs, music, videos or even entire movies.
Self-driving cars with their own identities. In short, any previous AI application now “belongs to itself”. In the future, humans may own nothing but rent services from AI DAOs.
Any DAO application infused with AI.
Any decentralized SaaS application with more autonomy.

See AI DAOs Part II for details. There are some really scary ones… https://medium.com/@trentmc0/wild-wooly-ai-daos-d1719e040956#.r6akj4ne0

Summarize

This article describes how blockchain technology can assist AI, based on my personal experience in AI and blockchain research. The combination of the two is explosive! Blockchain technology - especially planetary scale - can help realize some of the long-standing dreams of the AI and data communities and open up some opportunities.

To summarize:

<<: More than 70% of Swedish people know about Bitcoin, but only 10% expect the central bank's digital currency eKrona

>>: BBVA: Blockchain regulation shouldn’t stop at Bitcoin