Latent Semantic Analysis of Book Titles
Latent Semantic Analysis
Synonyms:
Polysemes:
“Man” (Human oppose to animal, vs. male vs. female, or “hey, man”,
“Milk” (verb, noun)
Latent variables:
combine words with similar meaning
z = 0.7 computer + 0.5 PC + 0.6 * laptop
(hidden variable to represent all of them)
Job of latent semantic analysis (LSA) is to find these variables and transform original data into these new variables and hopefully the dimensionality of these data is much smaller than the original. Allow us to speed up computation.
Does this help with Polysemy?
Conflicting viewpoints on whether it helps with Polysemy.
LSA is really Singular Value Decomposition (SVD) on the term-document matrix.
PCA is a simpler form of SVD.
Principle Components Analysis (PCA)
PCA rotates our original input vectors, same vector different coordinate system.
PCA does 3 things for us:
1) Decorrelates input data: data new coordinate system has zero correlation
2) Transformed data is ordered by information content: decreasing ordered, less information etc.
3) Dimensionality reduction: allows to reduce dimensionality (e.g. 1000 words => latent distnct terms might be 100)
removing information != decreasing predictive ability
Covariance
more variance is synonymous with more information,
non-matrix form:
Eigenvalues & Eigenvectors
Extending PCA - PCA helps us combine input features (words/terms, columns of input matrix)
“Term document matrices” - each term as input, document as sample
Combine and decorrelated by docuemtn? Just do PCA on the tranpose … which leads to weird results.
SVD (singular value decomposition)
SVD does both of these at the same time. PSA same time (lots of fun math)