gramaton cleric wrote:
Data mining class?
Close... Dynamics and structure of Networked Data. I came across a statement that there was an easy solution while reviewing some papers related to my own research. Trust me... I wish that I had simple homework problems at this point. =)
gramaton cleric wrote:
I've done a fair amount of data mining research and implementation. Theres a multitude of ways to approach this as your probably know. For most of these methods you'll have to have some kind of extra input parameter corresponding to quality, closeness, distance, etc. some boundary depending on approach.
That is one of the points that I never got around to mentioning. In higher dimensional spaces, there never seems to be a single distance measure that works for all types of clusters. You can almost always come up with a trivial counter-example. I've been wondering if this isn't the case with a single dimension though.
gramaton cleric wrote:
What I would consider is another partitional clustering algorithm. A less distance restricted way is the "fuzzy" way which is fuzzy because of fuzzy logic. Roughly speaking, you either chose a number of clusters or percent or you can even make it dynamic, then you build up cluster sets based on some metric(an easy one is 1/d where d is the distance) and once you're happy with the set you've built, remove the points and try again and eventually end up with the best sets being the most clustered.
Again, the 1d case is supposed to have a simple solution. I don't think that it will require
a priori knowledge of the number of clusters. Currently, my personal preference is the modularity based methods (they are almost certainly much faster and scalable).
gramaton cleric wrote:
Another way is similar to an MST approach with the Prim algorithm. You build up cluster sets by combining the two closest points. Again, you have to have some kind of extra parameter determining amount per set, number of sets, distance or something else.
I've heard people mention using an MST for this type of thing in the past. I just have this feeling that it fails with certain configurations.