项目作者: thetobysiu

项目描述 :
Using K-means clustering to categorize subject
高级语言: Jupyter Notebook
项目地址: git://github.com/thetobysiu/K-means-clustering-research.git
创建时间: 2019-09-03T15:47:14Z
项目社区:https://github.com/thetobysiu/K-means-clustering-research

开源协议:

下载


  1. # Import modules
  2. import numpy as np
  3. import pandas as pd
  4. import nltk
  5. # Set seed for reproducibility
  6. np.random.seed(5)
  7. xls = pd.ExcelFile('Glossary.xls')
  8. xls.sheet_names
  1. ['Glossary', 'alternative_tag', 'subject_type']
  1. glossary_df = xls.parse('Glossary')
  2. print(glossary_df.head())
  1. language subject_id keyword \
  2. 0 chi 0 國際收支平衡表
  3. 1 eng 0 BoP account
  4. 2 chi 0 甲類、乙類、丙類及綜合消費物價指數
  5. 3 eng 0 CPI(A), CPI(B), CPI(C) and Composite CPI
  6. 4 chi 0 死亡
  7. simplified_definition detail_definition
  8. 0 NaN 國際收支平衡表是有系統地載錄,在指定期間內,某經濟體系與世界各地的各類經濟交易的統計表。
  9. 1 NaN A BoP account is a statistical statement that ...
  10. 2 NaN 綜合消費物價指數反映消費物價轉變對整體住戶的影響,而甲類、乙類及丙類消費物價指數則分別以開支...
  11. 3 NaN The Composite CPI reflects the impact of consu...
  12. 4 NaN 死亡指某人在活產後的任何期間,永久失去所有生命徵象。
  1. subject_df = xls.parse('subject_type')
  2. print(subject_df.head())
  1. id subject_id chi_name \
  2. 0 1 10 官方統計綜合報告及參考資料
  3. 1 2 20 人口
  4. 2 3 30 勞工
  5. 3 4 40 對外貿易
  6. 4 5 50 國民收入及國際收支平衡
  7. eng_name
  8. 0 General Reports and References in Official Sta...
  9. 1 Population
  10. 2 Labour
  11. 3 External Trade
  12. 4 National Income and Balance of Payments
  1. subject_df = subject_df.set_index('subject_id')
  2. print(subject_df.head())
  1. id chi_name \
  2. subject_id
  3. 10 1 官方統計綜合報告及參考資料
  4. 20 2 人口
  5. 30 3 勞工
  6. 40 4 對外貿易
  7. 50 5 國民收入及國際收支平衡
  8. eng_name
  9. subject_id
  10. 10 General Reports and References in Official Sta...
  11. 20 Population
  12. 30 Labour
  13. 40 External Trade
  14. 50 National Income and Balance of Payments
  1. theme_df = glossary_df[['subject_id', 'language', 'keyword', 'detail_definition']]
  2. print(theme_df.head())
  1. subject_id language keyword \
  2. 0 0 chi 國際收支平衡表
  3. 1 0 eng BoP account
  4. 2 0 chi 甲類、乙類、丙類及綜合消費物價指數
  5. 3 0 eng CPI(A), CPI(B), CPI(C) and Composite CPI
  6. 4 0 chi 死亡
  7. detail_definition
  8. 0 國際收支平衡表是有系統地載錄,在指定期間內,某經濟體系與世界各地的各類經濟交易的統計表。
  9. 1 A BoP account is a statistical statement that ...
  10. 2 綜合消費物價指數反映消費物價轉變對整體住戶的影響,而甲類、乙類及丙類消費物價指數則分別以開支...
  11. 3 The Composite CPI reflects the impact of consu...
  12. 4 死亡指某人在活產後的任何期間,永久失去所有生命徵象。
  1. theme_df_pivot = theme_df.pivot(columns='language', values=['subject_id', 'keyword', 'detail_definition'])
  2. # flatten multiindex
  3. theme_df_pivot.columns = ['_'.join(col).strip() for col in theme_df_pivot.columns.values]
  4. print(theme_df_pivot.head())
  1. subject_id_chi subject_id_eng keyword_chi \
  2. 0 0 NaN 國際收支平衡表
  3. 1 NaN 0 NaN
  4. 2 0 NaN 甲類、乙類、丙類及綜合消費物價指數
  5. 3 NaN 0 NaN
  6. 4 0 NaN 死亡
  7. keyword_eng \
  8. 0 NaN
  9. 1 BoP account
  10. 2 NaN
  11. 3 CPI(A), CPI(B), CPI(C) and Composite CPI
  12. 4 NaN
  13. detail_definition_chi \
  14. 0 國際收支平衡表是有系統地載錄,在指定期間內,某經濟體系與世界各地的各類經濟交易的統計表。
  15. 1 NaN
  16. 2 綜合消費物價指數反映消費物價轉變對整體住戶的影響,而甲類、乙類及丙類消費物價指數則分別以開支...
  17. 3 NaN
  18. 4 死亡指某人在活產後的任何期間,永久失去所有生命徵象。
  19. detail_definition_eng
  20. 0 NaN
  21. 1 A BoP account is a statistical statement that ...
  22. 2 NaN
  23. 3 The Composite CPI reflects the impact of consu...
  24. 4 NaN
  1. eng_df = theme_df_pivot[['subject_id_eng', 'keyword_eng', 'detail_definition_eng']]
  2. eng_df.head()











































subject_id_eng keyword_eng detail_definition_eng
0 NaN NaN NaN
1 0 BoP account A BoP account is a statistical statement that …
2 NaN NaN NaN
3 0 CPI(A), CPI(B), CPI(C) and Composite CPI The Composite CPI reflects the impact of consu…
4 NaN NaN NaN

  1. eng_df = eng_df.dropna()
  2. print(eng_df.head())
  3. print(eng_df.describe())
  1. subject_id_eng keyword_eng \
  2. 1 0 BoP account
  3. 3 0 CPI(A), CPI(B), CPI(C) and Composite CPI
  4. 5 0 Death
  5. 7 0 Domestic household
  6. 9 0 Employed persons
  7. detail_definition_eng
  8. 1 A BoP account is a statistical statement that ...
  9. 3 The Composite CPI reflects the impact of consu...
  10. 5 A death refers to the permanent disappearance ...
  11. 7 Consist of a group of persons who live togethe...
  12. 9 Refer to those persons aged >=15 who have been...
  13. subject_id_eng keyword_eng \
  14. count 728 728
  15. unique 34 345
  16. top 170 Establishment
  17. freq 65 9
  18. detail_definition_eng
  19. count 728
  20. unique 414
  21. top Population by-census is a statistical project ...
  22. freq 6
  1. # eng_df['concat'] = eng_df['keyword_eng'] + ' ' + eng_df['detail_definition_eng']
  2. # eng_df = eng_df[['subject_id_eng', 'concat']]
  3. # print(eng_df.head())
  1. eng_dfg = eng_df.groupby('subject_id_eng')
  2. print(eng_dfg.get_group(0))
  3. print(eng_dfg.get_group(0).info())
  1. subject_id_eng keyword_eng \
  2. 1 0 BoP account
  3. 3 0 CPI(A), CPI(B), CPI(C) and Composite CPI
  4. 5 0 Death
  5. 7 0 Domestic household
  6. 9 0 Employed persons
  7. 11 0 Exports of services
  8. 13 0 Gross Domestic Product (GDP)
  9. 15 0 Imports of goods
  10. 17 0 Imports of services
  11. 19 0 Insurance services
  12. 21 0 Labour force
  13. 23 0 Nominal Wage Index
  14. 25 0 Total exports of goods
  15. 27 0 Total merchandise trade
  16. 29 0 Underemployment rate
  17. 31 0 Unemployment rate
  18. 32 0 yoy%
  19. detail_definition_eng
  20. 1 A BoP account is a statistical statement that ...
  21. 3 The Composite CPI reflects the impact of consu...
  22. 5 A death refers to the permanent disappearance ...
  23. 7 Consist of a group of persons who live togethe...
  24. 9 Refer to those persons aged >=15 who have been...
  25. 11 Exports of services are the sales of services ...
  26. 13 GDP is a measure of the total value of product...
  27. 15 Imports of goods refer to goods which have bee...
  28. 17 Imports of services are the purchases of servi...
  29. 19 Insurance services cover all types of direct i...
  30. 21 Labour force refers to the land-based non-inst...
  31. 23 Nominal Wage Index measures the pure changes i...
  32. 25 Total exports of goods comprise domestic expor...
  33. 27 Total merchandise trade refers to all the move...
  34. 29 Underemployment rate refers to the proportion ...
  35. 31 Unemployment rate refers to the proportion of ...
  36. 32 Year-on-year percentage change
  37. <class 'pandas.core.frame.DataFrame'>
  38. Int64Index: 17 entries, 1 to 32
  39. Data columns (total 3 columns):
  40. subject_id_eng 17 non-null object
  41. keyword_eng 17 non-null object
  42. detail_definition_eng 17 non-null object
  43. dtypes: object(3)
  44. memory usage: 544.0+ bytes
  45. None
  1. # Since subject_id 0 is not a theme, but a predefined category
  2. eng_clean_df = eng_df[eng_df['subject_id_eng'] != 0]
  3. print(eng_clean_df)
  1. subject_id_eng keyword_eng \
  2. 34 20 "De facto" concept
  3. 36 20 "De jure" concept
  4. 38 20 Child dependency ratio
  5. 40 20 Crude birth rate
  6. 42 20 Crude death rate
  7. 44 20 Crude marriage rate
  8. 46 20 Death
  9. 48 20 Domestic household
  10. 50 20 Elderly dependency ratio
  11. 52 20 Expectation of life at birth
  12. 54 20 Extended "de facto" approach
  13. 56 20 Live births
  14. 58 20 Median age
  15. 60 20 Median age at first marriage
  16. 62 20 Median Mortgage Payment and Loan Repayment to ...
  17. 64 20 Median Rent to Income Ratio
  18. 66 20 Mobile Residents
  19. 68 20 Mortgage Payment and Loan Repayment to Income ...
  20. 70 20 Natural Increase
  21. 72 20 Net movement
  22. 74 20 Overall dependency ratio
  23. 76 20 Population by-census
  24. 78 20 Population census
  25. 80 20 Population growth rate
  26. 82 20 Registered marriage
  27. 84 20 Rent to Income Ratio
  28. 86 20 Resident population
  29. 88 20 Sex ratio
  30. 90 20 Total fertility rate
  31. 92 20 Usual Residents
  32. ... ... ...
  33. 1378 461 Demographic dependency ratio
  34. 1379 461 Economic dependency ratio
  35. 1380 461 Economic activity status
  36. 1381 461 Monthly household income
  37. 1382 461 Pre-intervention
  38. 1383 461 Post-intervention (recurrent cash)
  39. 1384 461 Post-intervention (recurrent + non-recurrent c...
  40. 1385 461 Post-intervention (recurrent cash + in-kind)
  41. 1386 461 Policy intervention measures
  42. 1387 461 Taxation
  43. 1388 461 Recurrent cash benefits
  44. 1389 461 Non-recurrent cash benefits
  45. 1390 461 In-kind benefits
  46. 1391 461 Persons
  47. 1392 461 Economically active persons
  48. 1393 461 Economically inactive persons
  49. 1394 461 Employed persons
  50. 1395 461 Full-time workers
  51. 1396 461 Part-time workers
  52. 1397 461 Underemployed persons
  53. 1398 461 Unemployed persons
  54. 1399 461 Household head
  55. 1400 461 Unemployment rate
  56. 1401 461 Median
  57. 1402 461 Percentiles
  58. 1403 461 Poverty indicators
  59. 1404 461 Poverty incidence
  60. 1405 461 Poverty rate
  61. 1406 461 Poverty gap
  62. 1407 461 Poverty line
  63. detail_definition_eng
  64. 34 Under a "de facto" concept, the population inc...
  65. 36 Under a "de jure" concept, all persons who usu...
  66. 38 Child dependency ratio refers to the number of...
  67. 40 Crude birth rate refers to the number of live ...
  68. 42 Crude death rate refers to the number of death...
  69. 44 Crude marriage rate refers to the number of ma...
  70. 46 A death refers to the permanent disappearance ...
  71. 48 A domestic household consists of a group of pe...
  72. 50 Elderly dependency ratio refers to the number ...
  73. 52 Expectation of life at birth refers to the num...
  74. 54 Under the "extended de facto" approach, the Ho...
  75. 56 A live birth refers to the complete expulsion ...
  76. 58 Median age is an indicator of the average age ...
  77. 60 Median age at first marriage is an indicator o...
  78. 62 It refers to the average percentage of monthly...
  79. 64 It refers to the average percentage of monthly...
  80. 66 The "Hong Kong Resident Population" comprises ...
  81. 68 It refers to the percentage of monthly househo...
  82. 70 Natural increase refers to the balance of live...
  83. 72 It is the number of inflow to a country/territ...
  84. 74 Overall dependency ratio refers to the number ...
  85. 76 Population by-census is a statistical project ...
  86. 78 Population census refers to a large-scale stat...
  87. 80 Population growth rate refers to the populatio...
  88. 82 A registered marriage is defined as a voluntar...
  89. 84 It refers to the percentage of monthly househo...
  90. 86 The practical definition of "resident populati...
  91. 88 Sex ratio refers to the ratio of the number of...
  92. 90 Total fertility rate refers to the average num...
  93. 92 The "Hong Kong Resident Population" comprises ...
  94. ... ...
  95. 1378 Refers to the number of persons aged below 18 ...
  96. 1379 Refers to the number of economically inactive ...
  97. 1380 Households / population can be classified into...
  98. 1381 The total income earned by all member(s) of th...
  99. 1382 This income type only includes household membe...
  100. 1383 Refers to the household income after tax, incl...
  101. 1384 Refers to the household income after tax, incl...
  102. 1385 Refers to the household income after tax, incl...
  103. 1386 According to the discussion of Commission on P...
  104. 1387 Includes salaries tax and property tax, as wel...
  105. 1388 Refer to cash-based benefits / cash-equivalent...
  106. 1389 Refer to non-recurrent cash benefits provided ...
  107. 1390 Refer to in-kind benefits provided with means ...
  108. 1391 Refer to those persons residing in domestic ho...
  109. 1392 Synonymous with the labour force, comprise the...
  110. 1393 Include all persons who have not had a job and...
  111. 1394 For a person aged 15 or over to be classified ...
  112. 1395 Refer to employed persons who work 35 hours an...
  113. 1396 Refer to employed persons who work less than 3...
  114. 1397 The criteria for an employed person to be clas...
  115. 1398 For a person aged 15 or over to be classified ...
  116. 1399 A household head is acknowledged by other fami...
  117. 1400 Refers to the proportion of unemployed persons...
  118. 1401 For an ordered data set which is arranged in a...
  119. 1402 Percentiles are the 99 values that divide an o...
  120. 1403 Quantitative measurements of poverty.
  121. 1404 Refers to the number of poor households and th...
  122. 1405 The ratio of the poor population to the total ...
  123. 1406 Poverty gap of a poor household refers to the ...
  124. 1407 A threshold to define poor households and thei...
  125. [711 rows x 3 columns]
  1. merged_series = eng_clean_df.groupby('subject_id_eng')['detail_definition_eng'].apply(' '.join)
  2. print(merged_series)
  1. subject_id_eng
  2. 20 Under a "de facto" concept, the population inc...
  3. 30 Casual employee refers to an employee who is e...
  4. 40 Charges for the use of intellectual property r...
  5. 50 Balance of Payments (BoP) is a statistical sta...
  6. 60 The Consumer Price Index (CPI) measures the ch...
  7. 70 (i) For the Industry Section of Transportation...
  8. 120 R&D capital expenditure includes actual ex...
  9. 130 Total cargo discharged includes imports and in...
  10. 150 Under a "de facto" concept, the population inc...
  11. 160 Under a "de facto" concept, the population inc...
  12. 170 If a person is able to conduct a short convers...
  13. 180 Under a "de facto" concept, the population inc...
  14. 190 Under a "de facto" concept, the population inc...
  15. 200 Casual employee refers to an employee who is e...
  16. 210 Hourly wage is derived by dividing (i) the amo...
  17. 230 Exports of goods to the mainland of China for ...
  18. 240 Charges for the use of intellectual property r...
  19. 250 Changes in inventories refer to the value of p...
  20. 260 Balance of Payments (BoP) is a statistical sta...
  21. 270 The Consumer Price Index (CPI) measures the ch...
  22. 280 Movement in producer price index for the indus...
  23. 290 Based on the household expenditure patterns ob...
  24. 300 (i) For the Industry Section of Transportation...
  25. 310 Compensation of employees = Wages and salaries...
  26. 320 For (i) the Industry Sections of import/export...
  27. 330 Compensation of employees\n= Wages and salarie...
  28. 340 (i) For the Industry Section of Transportation...
  29. 350 (i) For information and communications industr...
  30. 360 A local office is an office with parent compan...
  31. 452 Civil servants refer to persons who are employ...
  32. 454 Rate of gross margin refers to the gross margi...
  33. 459 If a person aged 5 and over (excluding mute pe...
  34. 461 Refer to a group of persons who live together ...
  35. Name: detail_definition_eng, dtype: object
  1. # read subject_id_dict
  2. subject_id_dict = subject_df['eng_name'].to_dict()
  3. print(subject_id_dict)
  1. {10: 'General Reports and References in Official Statistics', 20: 'Population', 30: 'Labour', 40: 'External Trade', 50: 'National Income and Balance of Payments', 60: 'Prices', 70: 'Business Performance', 80: 'The Four Key Industries and the Six Industries', 90: 'Energy', 100: 'Housing and Property', 110: 'Public Accounts, Money and Finance', 120: 'Science and Technology', 130: 'Transport, Communications and Tourism', 140: 'Miscellaneous Statistics', 150: 'Population Estimates', 160: 'Demographics', 170: '2011 Population Census', 180: 'Gender', 190: 'Population Projections', 200: 'Labour Force', 452: 'Employment and Vacancies', 210: 'Wages and Labour Earnings', 220: 'Manpower', 230: 'Merchandise Trade', 240: 'Trade in Services', 250: 'National Income', 260: 'Balance of Payments', 270: 'Consumer Prices', 280: 'Producer Prices', 290: 'Household Expenditures', 300: 'Business Performance and Prospects', 310: 'Industrial Production', 320: 'Import/export, wholesale and retail trades, and accommodation and food services sectors', 330: 'Building, Construction and Real Estate Sectors', 340: 'Transportation, Storage and Courier Services Sector', 350: 'Information and communications, financing and Insurance, Professional and Business Services Sectors', 360: 'Companies in Hong Kong with Parent Companies Located outside HK', 370: 'Education', 380: 'Health', 390: 'Social Welfare', 400: 'Law and Order', 410: 'Culture, Entertainment and Recreation', 420: 'Environment, Climate and Geography', 430: 'Others', 440: '2006 By-census', 450: '2001 Population census', 454: 'Offshore Trade in Goods', 459: '2016 By-census', 461: 'Poverty Situation', 0: 'For Browse by Subject Page Only'}
  1. # Import the SnowballStemmer to perform stemming
  2. from nltk.stem.snowball import SnowballStemmer
  3. from nltk.corpus import stopwords
  4. import re
  5. stemmer = SnowballStemmer("english")
  6. # Define a function to perform both stemming and tokenization
  7. def tokenize_and_stem(text):
  8. # Tokenize by sentence, then by word
  9. tokens = nltk.word_tokenize(text)
  10. # Filter out raw tokens to remove noise
  11. filtered_tokens = [token for token in tokens if token not in stopwords.words('english') if re.search('[a-zA-Z]', token)]
  12. # Stem the filtered_tokens
  13. stems = [stemmer.stem(word) for word in filtered_tokens]
  14. return stems
  1. import nltk
  2. nltk.download('stopwords')
  1. [nltk_data] Downloading package stopwords to
  2. [nltk_data] C:\Users\Toby\AppData\Roaming\nltk_data...
  3. [nltk_data] Package stopwords is already up-to-date!
  4. True
  1. words_stemmed = tokenize_and_stem(merged_series[20])
  2. print(words_stemmed)
  1. ['under', 'de', 'facto', 'concept', 'popul', 'includ', 'person', 'country/territori', 'refer', 'time-point', 'that', 'method', 'equival', 'take', 'snapshot', 'popul', 'refer', 'time-point', 'under', 'de', 'jure', 'concept', 'person', 'usual', 'live', 'country/territori', 'particular', 'refer', 'time-point', 'usual', 'taken', 'middl', 'year', 'count', 'popul', 'country/territori', 'child', 'depend', 'ratio', 'refer', 'number', 'person', 'age', 'per', 'person', 'age', 'crude', 'birth', 'rate', 'refer', 'number', 'live', 'birth', 'given', 'year', 'per', 'mid-year', 'popul', 'year', 'crude', 'death', 'rate', 'refer', 'number', 'death', 'given', 'year', 'per', 'mid-year', 'popul', 'year', 'crude', 'marriag', 'rate', 'refer', 'number', 'marriag', 'regist', 'given', 'year', 'per', 'mid-year', 'popul', 'year', 'a', 'death', 'refer', 'perman', 'disappear', 'evid', 'life', 'live', 'birth', 'taken', 'place', 'a', 'domest', 'household', 'consist', 'group', 'person', 'live', 'togeth', 'make', 'common', 'provis', 'essenti', 'live', 'these', 'person', 'need', 'relat', 'if', 'person', 'make', 'provis', 'essenti', 'live', 'without', 'share', 'person', 'he/sh', 'also', 'regard', 'household', 'in', 'case', 'household', 'one-person', 'household', 'in', 'figur', 'compil', 'general', 'household', 'survey', 'refer', 'period', 'start', 'q1', 'popul', 'by-census', 'household', 'compris', 'mobil', 'resid', 'includ', 'domest', 'household', 'elder', 'depend', 'ratio', 'refer', 'number', 'person', 'age', 'per', 'person', 'age', 'expect', 'life', 'birth', 'refer', 'number', 'year', 'life', 'person', 'born', 'given', 'year', 'expect', 'live', 'he/sh', 'subject', 'preval', 'mortal', 'condit', 'reflect', 'set', 'age-sex', 'specif', 'mortal', 'rate', 'year', 'under', 'extend', 'de', 'facto', 'approach', 'hong', 'kong', 'popul', 'cover', 'person', 'hong', 'kong', 'refer', 'time', 'point', 'includ', 'hong', 'kong', 'perman', 'resid', 'hong', 'kong', 'non-perman', 'resid', 'well', 'visitor', 'extend', 'relat', 'fact', 'hong', 'kong', 'perman', 'resid', 'he/sh', 'still', 'count', 'part', 'hong', 'kong', 'popul', 'refer', 'time-point', 'he/sh', 'hong', 'kong', 'temporarili', 'mainland', 'china', 'macao', 'a', 'live', 'birth', 'refer', 'complet', 'expuls', 'extract', 'mother', 'product', 'concept', 'separ', 'breath', 'show', 'evid', 'life', 'median', 'age', 'indic', 'averag', 'age', 'total', 'number', 'person', 'age', 'median', 'age', 'first', 'marriag', 'indic', 'averag', 'age', 'person', 'first', 'marriag', 'person', 'age', 'it', 'refer', 'averag', 'percentag', 'month', 'household', 'incom', 'paid', 'month', 'mortgag', 'payment', 'loan', 'repay', 'calcul', 'domest', 'household', 'own', 'quarter', 'occupi', 'mortgag', 'loan', 'paid', 'percentag', 'paid', 'less', 'household', 'zero', 'incom', 'and/or', 'zero', 'mortgag', 'payment', 'loan', 'repay', 'household', 'member', 'i.e', 'mortgag', 'payment', 'loan', 'repay', 'non-household', 'member', 'exclud', 'calcul', 'it', 'refer', 'averag', 'percentag', 'month', 'household', 'incom', 'paid', 'month', 'household', 'rent', 'calcul', 'domest', 'household', 'rent', 'accommod', 'occupi', 'paid', 'percentag', 'paid', 'less', 'household', 'zero', 'incom', 'and/or', 'zero', 'rent', 'exclud', 'calcul', 'the', 'hong', 'kong', 'resid', 'popul', 'compris', 'usual', 'resid', 'mobil', 'resid', 'usual', 'resid', 'refer', 'two', 'categori', 'peopl', 'hong', 'kong', 'perman', 'resid', 'stay', 'hong', 'kong', 'least', 'month', 'month', 'least', 'month', 'month', 'refer', 'time-point', 'regardless', 'whether', 'hong', 'kong', 'refer', 'time-point', 'b', 'hong', 'kong', 'non-perman', 'resid', 'hong', 'kong', 'refer', 'time-point', 'for', 'hong', 'kong', 'perman', 'resid', 'usual', 'resid', 'classifi', 'mobil', 'resid', 'stay', 'hong', 'kong', 'least', 'month', 'less', 'month', 'month', 'least', 'month', 'less', 'month', 'month', 'refer', 'time-point', 'regardless', 'whether', 'hong', 'kong', 'refer', 'time-point', 'under', 'resid', 'popul', 'approach', 'visitor', 'includ', 'hong', 'kong', 'popul', 'it', 'refer', 'percentag', 'month', 'household', 'incom', 'paid', 'month', 'mortgag', 'payment', 'loan', 'repay', 'domest', 'household', 'own', 'quarter', 'occupi', 'mortgag', 'loan', 'household', 'zero', 'incom', 'and/or', 'zero', 'mortgag', 'payment', 'loan', 'repay', 'household', 'member', 'i.e', 'mortgag', 'payment', 'loan', 'repay', 'non-household', 'member', 'exclud', 'calcul', 'natur', 'increas', 'refer', 'balanc', 'live', 'birth', 'death', 'year', 'it', 'number', 'inflow', 'country/territori', 'less', 'number', 'outflow', 'specifi', 'period', 'overal', 'depend', 'ratio', 'refer', 'number', 'person', 'age', 'age', 'per', 'person', 'age', 'popul', 'by-census', 'statist', 'project', 'collect', 'basic', 'popul', 'inform', 'sampl', 'person', 'popul', 'country/territori', 'a', 'by-census', 'usual', 'conduct', 'two', 'census', 'updat', 'inform', 'obtain', 'last', 'popul', 'census', 'popul', 'census', 'refer', 'large-scal', 'statist', 'project', 'collect', 'basic', 'popul', 'inform', 'person', 'popul', 'country/territori', 'usual', 'popul', 'census', 'conduct', 'everi', 'ten', 'year', 'popul', 'growth', 'rate', 'refer', 'popul', 'chang', 'period', 'percentag', 'popul', 'begin', 'period', 'a', 'regist', 'marriag', 'defin', 'voluntari', 'union', 'life', 'one', 'man', 'one', 'woman', 'exclus', 'other', 'contract', 'accord', 'marriag', 'ordin', 're-registr', 'coupl', 'either', 'customarili', 'marri', 'hong', 'kong', 'marriag', 'reform', 'ordin', 'enact', 'octob', 'marri', 'outsid', 'hong', 'kong', 'also', 'cover', 'statist', 'marriag', 'statist', 'restrict', 'regist', 'marriag', 'it', 'refer', 'percentag', 'month', 'household', 'incom', 'paid', 'month', 'household', 'rent', 'domest', 'household', 'rent', 'accommod', 'occupi', 'all', 'zero', 'incom', 'household', 'and/or', 'zero', 'rent', 'household', 'exclud', 'calcul', 'the', 'practic', 'definit', 'resid', 'popul', 'adopt', 'vari', 'place', 'place', 'resid', 'mobil', 'pattern', 'uniqu', 'place', 'need', 'given', 'adequ', 'consider', 'in', 'case', 'hong', 'kong', 'resid', 'popul', 'hong', 'kong', 'refer', 'hong', 'kong', 'resid', 'popul', 'defin', 'includ', 'usual', 'resid', 'mobil', 'resid', 'sex', 'ratio', 'refer', 'ratio', 'number', 'male', 'per', 'femal', 'total', 'fertil', 'rate', 'refer', 'averag', 'number', 'children', 'would', 'born', 'aliv', 'women', 'lifetim', 'pass', 'childbear', 'age', 'experienc', 'age', 'specif', 'fertil', 'rate', 'prevail', 'given', 'year', 'the', 'hong', 'kong', 'resid', 'popul', 'compris', 'usual', 'resid', 'mobil', 'resid', 'usual', 'resid', 'refer', 'two', 'categori', 'peopl', 'hong', 'kong', 'perman', 'resid', 'stay', 'hong', 'kong', 'least', 'month', 'month', 'least', 'month', 'month', 'refer', 'time-point', 'regardless', 'whether', 'hong', 'kong', 'refer', 'time-point', 'b', 'hong', 'kong', 'non-perman', 'resid', 'hong', 'kong', 'refer', 'time-point', 'for', 'hong', 'kong', 'perman', 'resid', 'usual', 'resid', 'classifi', 'mobil', 'resid', 'stay', 'hong', 'kong', 'least', 'month', 'less', 'month', 'month', 'least', 'month', 'less', 'month', 'month', 'refer', 'time-point', 'regardless', 'whether', 'hong', 'kong', 'refer', 'time-point', 'under', 'resid', 'popul', 'approach', 'visitor', 'includ', 'hong', 'kong', 'popul']
  1. # Create TfidfVectorizer
  2. # Import TfidfVectorizer to create TF-IDF vectors
  3. from sklearn.feature_extraction.text import TfidfVectorizer
  4. # Instantiate TfidfVectorizer object with stopwords and tokenizer
  5. # parameters for efficient processing of text
  6. tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
  7. min_df=0.2, stop_words='english',
  8. use_idf=True, tokenizer=tokenize_and_stem,
  9. ngram_range=(1,3))
  1. # Fit and transform the tfidf_vectorizer with the "description" of subject
  2. # to create a vector representation of the plot summaries
  3. tfidf_matrix = tfidf_vectorizer.fit_transform([x for x in merged_series])
  4. print(tfidf_matrix.shape)
  1. C:\Users\Toby\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:301: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['afterward', 'alon', 'alreadi', 'alway', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becom', 'besid', 'cri', 'describ', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'otherwis', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev'] not in stop_words.
  2. 'stop_words.' % sorted(inconsistent))
  3. (33, 805)

Import KMeans and create clusters

  1. # Import k-means to perform clusters
  2. from sklearn.cluster import KMeans
  3. # All subject id could be group by 17 category(according to subject_id 0) and 1 miscellaneous
  4. km = KMeans(n_clusters=17)
  5. # Fit the k-means object with tfidf_matrix
  6. km.fit(tfidf_matrix)
  7. clusters = km.labels_.tolist()
  8. print(clusters)
  1. [1, 8, 6, 5, 3, 14, 7, 10, 1, 1, 15, 1, 1, 2, 8, 6, 4, 5, 5, 3, 12, 3, 0, 13, 14, 14, 0, 0, 11, 8, 9, 15, 16]
  1. # Import cosine_similarity to calculate similarity of movie plots
  2. from sklearn.metrics.pairwise import cosine_similarity
  3. # Calculate the similarity distance
  4. similarity_distance = 1 - cosine_similarity(tfidf_matrix)
  1. # Import matplotlib.pyplot for plotting graphs
  2. import matplotlib.pyplot as plt
  3. # Configure matplotlib to display the output inline
  4. %matplotlib inline
  5. # Import modules necessary to plot dendrogram
  6. from scipy.cluster.hierarchy import linkage, dendrogram
  1. # Create mergings matrix
  2. mergings = linkage(similarity_distance, method='complete')
  3. # Plot the dendrogram, using title as label column
  4. dendrogram_ = dendrogram(mergings,
  5. labels=[subject_id_dict[x] for x in merged_series.index],
  6. leaf_rotation=90,
  7. leaf_font_size=16,
  8. )
  9. # Adjust the plot
  10. fig = plt.gcf()
  11. _ = [lbl.set_color('r') for lbl in plt.gca().get_xmajorticklabels()]
  12. fig.set_size_inches(108, 21)
  13. # Show the plotted dendrogram
  14. plt.show()

png

  1. #testing how many cluster should be applied
  2. #choose when the decrease in inertia start to decrease
  3. ks = range(10, 30)
  4. inertias = []
  5. for k in ks:
  6. # Create a KMeans instance with k clusters: model
  7. model = KMeans(n_clusters=k)
  8. # Fit model to samples
  9. model.fit(tfidf_matrix)
  10. # Append the inertia to the list of inertias
  11. inertias.append(model.inertia_)
  12. # Plot ks vs inertias
  13. plt.plot(ks, inertias, '-o')
  14. plt.xlabel('number of clusters, k')
  15. plt.ylabel('inertia')
  16. plt.xticks(ks)
  17. plt.show()

png

  1. # the best cluster should be around 20 where the slope starts to decrease