Automated Folksonomy in SharePoint 2010

Introduction

In The Power of Derived Ontology a discussion is held about the difference between an automated Ontology and a human derived one using association networks. These derive the relationship between concepts. Earlier this year I devised a simple system for deriving the relationships between documents using an automated folksonomy.

At first blush the concept of an automated folksonomy seems contradictory. It is supposed to be human derived, a sort of distributed classification system. What is meant in this case is an AI parses the document and derives the author’s primary features of it by simple statistical analysis and enabling it with Microsoft Office Share Point (MOSS) 2010. The first tags applied to the document are by AI and are later supplemented by visitors to the document. This paper will go further and suggest a model for a higher degree of document classification utilizing features derived from the document and meta data and propose certain functions to automatically group the documents. In a non-heterogenous network, with files scattered about, automated grouping and clustering can be extremely useful. Once the information is captured and placed into a database, various functions can be applied to help automate tasks that otherwise would require people to perform manually. The concept of using an AI to incorporate the author into folksonomic classification of documents is a fast way of bootstrapping the very useful tagging function of MOSS 2010. The major benefits will be outlined below.

Document Classification

A taxonomy incorporates a classification system where things are clustered by similarity of features. In modern taxonomy we have trees of classes, with lower order categories being much more specific. This is a useful analytic tool. One is able to put a box around something and say that is exactly what it is. Folksonomy, however, is much less adaptable to an analytic approach. In order to do so, each contributor would need to understand the complete tree of organization, suggest sub-classifications and have those approved by some authority in order to keep down the chaos. Folksonomy is, instead, a synthetic approach that derives its benefit from the chaotic nature of groups loosely collaborating to tag individual documents. It is the definition of informal. Critics of folksonomy say it is imprecise, ambiguous and often overly personal. This is somewhat true, but discovery is different than search. Folksonomy helps with discovery more so than search. It allows people to place impressions that aren’t available via analysis. For example a document might be labeled as “funny”, which is something AI systems are likely never to understand (or maybe what they will find funny will be completely alien to us.

When we classify plants and animals we look for prominent features. Do they have limbs and if so, how many? Do they have fins or scales and if so what is their placement and configuration? Documents have features too. They are the nouns contained within them. The theory here is that each noun represents a potential feature. The more it occurs the more prominent it is. We also classify our nouns by their size. Unigram nouns require a higher threshold to be considered prominent than bigrams. Larger n-grams are considered prominent simply for the fact that no one bothers to write out such long nouns unless they are incredibly important. Compare:

woman

Miss Rice

Secretary of State Condoleezza Rice

In the iteration of software that was created in early 2011 to do this processing no emphasis was placed upon proper nouns. That may become a feature that later will prove to be one of interest. The initial software simply considered noun frequency for importance. It works well and simpler systems tend to be more robust.

Sometimes the terms that come out of this kind of analysis appear to be random. They are most assuredly not and this illustrates why we can’t rely upon common sense to determine if these are “good values” or not. These are statistically prominent features taken directly from the document. Rhetorically; if the words in the document are not good enough to classify it then what is? This process of automated concept mapping would be labor intensive by hand and likely replaced by informed guessing if a human were to try to complete the same task.

Pass/Fail Threshold

Not all documents have a clear subject matter. When this software was being built it was obvious that many documents were not suitable either because they were simply forms, whose title was as informative as anything else as to the contents or they were too short to produce enough curve in the histogram. A method was devised to automate the culling process and to eliminate files whose subject matter was to subtle to discern. An example file that was well written, to the point but too short to benefit from automated tagging was the Corporate Values Statement. The title alone said enough about what the document was about. The content consisted of several value statements, unconnected to each other other than by being in the same document, with almost no nouns having a frequency >1.

In practice it was found that one third of the files would be disqualified from processing. The client was a little concerned but further analysis showed that 97% of the information in the corpus would be retained. That 1/3rd of files did not contain enough variation to distinguish it and as it turns out we found that the file name was good enough to know exactly what was in the file.

The measure that was used to determine if a file met the necessary criteria was the Coefficient of Variability, also known as the Relative Standard Deviation. This is the measure of the standard deviation over the mean. The higher the number the more likely the file had good differentiation so that key concepts emerged.

N-gram Retention

The final item that needs to be addressed about concept extraction is which ones do you keep? With our corpus we treated uni-grams, bi-grams and n-grams separately. Uni-grams got the most scrutiny and higher thresholds were set on keeping them. Once stacked by frequency only the top 10 were candidates and only if they repeated.

Bi-grams were given a lower threshold. When a noun group is bigger than one word it tends to be more important. However for longer documents cutoff have to be made.

N-grams are always considered important and are kept. Even for very large documents this doesn’t seem to pose a problem.

SharePoint 2010 Incorporation

Using this process with SharePoint 2010 is very straight forward. Triggers are written for receiver events ItemAdded and ItemUpdated. The process is pretty simple. Use iFilters to extract the text from the file. An asynchronous workflow did the trick. We can’t rely on the Python process for speed and we know there will be a lot of potential users of the system so taking the requests and putting them in a queue and releasing the UI back to the user quicker is just plain better design. The triggers both use the C# command System.Diagnostics.Process() (and a lot of other code, of course) to pipe the file’s text to the process. The process returns the concept tags and these are then applied to the item. In the case of an update, read the existing tags in first, then compare to the process output and add any that are new. It is possible that someone added a tag that the process would not find and you want to preserve that.

Conclusion

This was an excellent project. Myself and one other programmer were involved. From inception to production was about 5 weeks. I believe that this concept has a much broader meaning beyond aiding folksonomy. These tags plus meta data can be stored in database tables and analyzed more fully. The similarity in concept maps between files can be calculated and from there many types of data manipulation can be wrought. Also the python process can be compiled as an executable and incorporated into a WCF process. This would make it a hosted service and would cut down greatly on process startup times.