PyTorch BigGraph, Alibaba’s Euler, and the PersLay Framework.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand this near future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
This letter has a topic, which is the data that is not in table form. Huh? Yep, lots of data sets look pretty “tabular”. It comes in tables. Some data, like images, comes in different forms, but can still be put into tables.
But SOME data does not come in table form. That’s the basic idea behind a field called “geometric deep learning”. The most important form that data comes is: graphs.
Now there are two ways of dealing with “non-tabular” data.
- Push it into table form anyways… via embeddings for instance => allows us to use all standard table algorithms.
- Actually use the natural form => Need to develop lots of new algorithms.
Enough of that, let’s dive into the three data points! Graph embedding engines, real graph learning & topological learning…
Ok, so graph embedding means taking the graph, extracting the graph information into a table form, and using that, the table of feature vectors with all our standard machine learning weaponry.
Although a lot of machine learners I’ve seen hand-code their embeddings, I like to not reinvent the wheel. In particular, because most of the “real” applications of graph embeddings usually include millions to billions of graph points. And there’s the catch: Graph embedding ain’t light on the memory, because you cannot batch the way you batch in usual machine learning for the embedding.
So it’s probably smart to use other people’s knowledge. I’ve been impressed with the results from facebook’s Pytorch BigGraph which can be used in a CLI style as well as a python package. It allows for distributed training which makes it great for large graphs (which usually all graphs are!). If you want to see it in action you can check out my blogpost on BigGraph with examples.
Alibaba also open-sourced their graph embedding engine “Euler” which I had only limited time to play with. Since the docs are in Chinese, I suggest you head straight to the English examples to check it out.
By definition, embeddings reduce information. So we lose anywhere between 0 to 100% of the graph information by using the embedding, even though we gain our usual toolkit.
So what’s the alternative? To use the graph directly in the learning mechanism. Graphs can be encoded in numbers, for instance using the “adjacency matrix”, but of course you might add a feature matrix, weights, etc. depending on the kind of graph you want to train on.
Prof. Max Weilling does some great work on this topic, together with Thomas Kipf. They pioneered “Graph Convolutional Networks’’ which basically use the adjacency matrix of the graph and multiply it into the activation function, thus using all the information available.
For testing out if graph convolutional networks can help your case, you can take a look at the Keras module which I tested in a blogpost.
All data can be looked at in a visual way. Graphs are well, graphs. Other numerical data sets form “data clouds” of various forms. Now if you look at any of these, you can see distinguishing features, like a close blob of lots of edges and dots in this graph there, and so on.
These things, the bigger visual pieces you can see, are what is called the “topological structure” of the data cloud. So basically, if the data looks like a ring (circle with a hole in the middle) then it’s topologically a ring, not a circle (which has no hole in the middle).
Ok enough of that, these big topological features, of course, can be used to train machine learning models, again using, for instance, the graph structure of data in a very direct way without losing too much information. Mathieu Carriere provides a nice article on a more abstract level about PersLay, a framework for training on graph data sets. He also provides a github tensorflow module and a bunch of comparison metrics for graph datasets which look good.
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. But I tend to be opinionated. But you can always hit the unsubscribe button!