Wednesday, April 11, 2007

Graphs and Relationships

Mathematically speaking, graphs are collections of edges and vertexes (nodes). They can be undirected (no arrows) or directed (arrows). Graphs are useful for understanding ideas like connectivity in telecommunications systems, combinatoric relationships in algorithms, and large relationship networks. At IBM Research in the mid-90s, graph connectivity was used to characterize the World Wide Web by noting that there are "hubs" and "authorities" based on linkage patterns. An authority is a page that everyone points to, while a hub is a page that points to many other pages. The Google Pagerank system borrowed the same idea but simplified it a bit to not consider the linkage graph to be a directed graph. While the role of Pagerank in improving Google over their competition is vastly overrated and misunderstood, the approach did have an impact on their success for at least a subset of the queries that they service.

I've been more recently working on another kind of graph called co-citation matrixes in my startup effort. The idea borrows on the analysis of academic research papers that looks at papers that cite or reference other papers. A direct citation is obvious: I put a reference to your paper in my paper. Co-citation analysis looks at the papers that are linked together by both citing another paper. Now, sometimes those co-citations are fairly spurious or backgrounders to fundamental or related ideas. Sometimes though they are directly related to the topic of the paper. The interesting question is how to find the best set of relationships out of the huge relationship matrix.

More to my particular problem, how do you look at a graph built out of combinations of direct references and named entity relationships (people, places, organizations) and simplify it to find significant relationships and linkages? There are some very cool algorithms to do that based on looking at multipath linkages and checking whether the sum of their weights or normalized occurrence counts is greater than the more direct pathways. There are also some similar methods in matrix mathematics that try to fit a "reduced dimensionality graph" onto the existing graph. In other words, if I have 100,000 nodes and their linkages (up to 10^10!) can I create a new matrix that is most similar to the original one but is only based on 200 nodes? The "most similar" requirement constrains the choice of new matrix to somehow maximally represent the data distribution in the original matrix, exposing the most significant patterns of interaction, effectively bottlenecking or distilling the representation down to just the most essential aspects of the observed patterns. Related approaches are rampant in our nervous systems, helping to identify edges in noisy visual fields and isolate novelty in memory.

Another aspect of graph theory that emerges (excuse the preparatory pun) in large relationship graphs has to do with adaptive theory in a way. If you have a group of nodes (say, people) and you create a graph of their business relationships to one another over time, the time evolution of those relationships has some interesting properties. Specifically, small cliques emerge and then tentacles reach out and start joining the cliques together. As the graph grows, there is a point at which a "giant connected component" typically arises. That is, all the nodes become suddenly connected together. What is interesting is that the probability of that component emerging is not linear in the number of edges and nodes, but emerges quite suddenly when the edge/node ratio reaches a certain point.

I like that as a broad example of a transition in self-organization that is not predictable by the parts alone, yet results in a new class of relationships. My work is a bit more prosaic, of course, and involves algorithmic efficiency considerations combined with usability considerations for the relationships I am trying to distill down into their bare essentials, but the depth of the subject still resonates in the background.

No comments: