Why
- Preprocessing to another technique
- Exploring
Why not use scikit-learn
- When things don't work useful to dig in with knowing how it works
- Do use this library, sometimes do stuff by hand
K-Means
- Convex problem, so will always find a solution but might be a local minima, not guaranteed to be the global minima.
- Run multiple times, randomise start, pick best
When does K-Means break down and when not
- If features (dimensions) have different scales
- KM assumes data well defined by centroid, and have same scales.
- !!AI so what kind of feature scaling to use?
- Need to scale features
- K-Means is resistant to irrelevant dimensions (no information)
- but if large variance in irrelevant dimension(s), fails (!!AI variance, or different scale? both?)
- If dimension isn't Gaussian but uniform random (noise), then also breaks down.
- Lesson:
- be careful when adding new features / dimensions.
- "curse of dimensionality": with too many dimensions, the actual notion of similarity breaks down. Euclidean distance with too many dimensions less useful.
- If non-isotropic (not linearly separatable), i.e. no clear centroid, will fail too
Choosing between results
- Given two sets of clustering results, what is error?
- Inertia: sum squared error between point and nearest centre (!!AI ?)
- But in fact this isn't useful. As you add more clusters this always decreases, but we're not penalising the increased complexity of the solution. (regularisation)
- Use if comparing two different clustering results given a priori assumption of number of clusters.
- Sillhouette coefficient: considers compactness of your own cluster, in relation to distance to points not part of your cluster.
- Use to choose optimum number of clusters.
Visualising errors
!!AI see presenter notes.
Spectral clustering
- Can capture non-linearly-separatable clusters. (non-isotropic).
- Estimates number of clusters.
- Not restricted to Euclidean distance as distance metric
- Pass in some "similarity matrix"
- Good for text.
!!AI see notes. requires solid linear algebra knowledge (maybe add references).
Looking at smallest eigenvectors, whereas PCA looks at largest eigenvectors, different problem.
Unnormalised spectral clustering tries to balance the size of the clusters
- Normalised spectral clustering tries to balance the degree (density) of the clusters.
- Ratio cut graph partitioning algorithms
Questions
- Time complexity?
- Not sure about analytically. But did recommendation problem with 400,000 data points on users and movies, took 30 minutes on laptop.
- You first go O(n^2) to calculate distances, then sparsify using threshold. Couldn't you just threshold first to sparsify, then calculate distances?