PROSAGA码农传奇-pytorch-为什么文本的特征提取不会返回所有可能的特征名称？

<div class =“post-text”itemprop =“text”>
  
    非常好的问题！虽然这不是一个
     <code>
 pytorch
 </code>
     问题，但一个
     <code>
 sklearn
 </code>
     一个=）
  
  
    我鼓励先通过这个
    <a href="https://www.kaggle.com/alvations/basic-nlp-with-nltk" rel="nofollow noreferrer">
      https://www.kaggle.com/alvations/basic-nlp-with-nltk
    </A>
    特别是“
    的
      矢量化与sklearn
    </强>
    “ 部分
  
  <HR />
  <H1>
    TL; DR
  </H1>
  
    如果我们使用
     <code>
 CountVectorizer
 </code>
    ，
  
   <pre>
 <code>
 from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer

sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."

with StringIO('\n'.join([sent1, sent2])) as fin:
    # Create the vectorizer
    count_vect = CountVectorizer()
    count_vect.fit_transform(fin)

# We can check the vocabulary in our vectorizer
# It's a dictionary where the words are the keys and 
# The values are the IDs given to each word. 
print(count_vect.vocabulary_)

</code>
 </pre>
  
    [OUT]：
  
   <pre>
 <code>
 {'brown': 0,
 'dog': 1,
 'fox': 2,
 'jumps': 3,
 'lazy': 4,
 'mr': 5,
 'over': 6,
 'quick': 7,
 'the': 8}

</code>
 </pre>
  
    的
      我们没有告诉矢量化器去除标点符号和标记化和小写，他们是如何做到的？
    </强>
  
  
    此外，在词汇表中，它是一个停用词，我们希望它消失...
并且跳跃不会被阻止或被引理！
  
  
    如果我们在sklearn中查看CountVectorizer的文档，我们会看到：
  
   <pre>
 <code>
 CountVectorizer(
 input=’content’, encoding=’utf-8’, 
 decode_error=’strict’, strip_accents=None, 
 lowercase=True, preprocessor=None, 
 tokenizer=None, stop_words=None, 
 token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), 
 analyzer=’word’, max_df=1.0, min_df=1, 
 max_features=None, vocabulary=None, 
 binary=False, dtype=<class ‘numpy.int64’>)

</code>
 </pre>
  
    更具体地说：
  
  <BLOCKQUOTE>
    
      的
        分析仪
      </强>
       ：string，{'word'，'char'，'char_wb'}或可调用
    
    
      该功能是否应由单词或字符n-gram组成。
  选项'char_wb'仅从单词内的文本创建字符n-gram
  边界;单词边缘的n-gram用空格填充。如果一个
  callable被传递它用于提取特征序列
  未经处理的原始输入。
    
    
      的
        预处理器
      </强>
       ：callable或None（默认）
    
    
      覆盖预处理（字符串转换）阶段
  保留标记化和n-gram生成步骤。
    
    
      的
        标记生成器
      </强>
       ：callable或None（默认）
    
    
      覆盖字符串标记化步骤，同时保留
  预处理和n-gram生成步骤。仅适用于分析仪
  =='字'。
    
    
      的
        STOP_WORDS
      </强>
       ：string {'english'}，list或None（默认）
    
    
      如果是“英语”，则使用英语的内置停用词列表。如果一个
  列表，假定该列表包含停用词，所有这些都将是
  从生成的令牌中删除。仅适用于analyzer =='word'。
  如果为None，则不使用停用词。
    
    
      的
        小写
      </强>
       ：boolean，默认为True
    
    
      在标记化之前将所有字符转换为小写。
    
  </BLOCKQUOTE>
  
    但是在例子的情况下
    <a href="http://shop.oreilly.com/product/0636920063445.do" rel="nofollow noreferrer">
      http://shop.oreilly.com/product/0636920063445.do
    </A>
    ，这并不是导致问题的停顿词。
  
  
    的
      如果我们明确使用英语停用词
    </强>
     从
    <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py" rel="nofollow noreferrer">
      https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py
    </A>
  
   <pre>
 <code>
 >>> from sklearn.feature_extraction.text import CountVectorizer
>>> one_hot_vectorizer = CountVectorizer(stop_words='english')

>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
 lowercase=True, max_df=1.0, max_features=None, min_df=1,
 ngram_range=(1, 1), preprocessor=None, stop_words='english',
 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
 tokenizer=None, vocabulary=None)

>>> one_hot_vectorizer.get_feature_names()
['arrow', 'banana', 'flies', 'fruit', 'like', 'time']

</code>
 </pre>
  
    的
      那么究竟是怎么回事呢？
       <code>
 stop_words
 </code>
       参数保留为无？
    </强>
  
  
    让我们尝试一下我在输入中添加一些单个字符的实验：
  
   <pre>
 <code>
 >>> corpus = ['Time flies flies like an arrow 1 2 3.', 'Fruit flies like a banana x y z.']

>>> one_hot_vectorizer = CountVectorizer()

>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
 lowercase=True, max_df=1.0, max_features=None, min_df=1,
 ngram_range=(1, 1), preprocessor=None, stop_words=None,
 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
 tokenizer=None, vocabulary=None)
>>> one_hot_vectorizer.get_feature_names() 
['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']

</code>
 </pre>
  
    他们都又走了!!!
  
  
    现在，如果我们深入研究文档，
    <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L738" rel="nofollow noreferrer">
      https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L738
    </A>
  
  <BLOCKQUOTE>
    
      的
        token_pattern
      </强>
       ：字符串
          正则表达式表示什么构成“令牌”，仅使用
          如果
       <code>
 analyzer == 'word'
 </code>
      。默认正则表达式选择2的标记
          或更多字母数字字符（标点符号完全被忽略
          并始终被视为令牌分隔符）。
    
  </BLOCKQUOTE>
  
    啊哈，这就是为什么所有单字符标记都被删除了！
  
  
    默认模式
     <code>
 CountVectorizer
 </code>
     是
     <code>
 token_pattern=r"(?u)\b\w\w+\b"
 </code>
    ，要使其能够采取单个字符，您可以尝试：
  
   <pre>
 <code>
 >>> one_hot_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b") 
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
 lowercase=True, max_df=1.0, max_features=None, min_df=1,
 ngram_range=(1, 1), preprocessor=None, stop_words=None,
 strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
 vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['1', '2', '3', 'a', 'an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time', 'x', 'y', 'z']

</code>
 </pre>
</DIV>