sklearn处理超大规模数据
处理超大规模数据(单机内存/硬盘无法容纳)是现如今可能碰到的问题之一,这当然不光是sklearn碰到的难题.sklearn的解决方法是(部分)模型使用partial_fit接口实现增量学习.
支持增量学习的模型有:
- Classification(分类)
- sklearn.naive_bayes.MultinomialNB
- sklearn.naive_bayes.BernoulliNB
- sklearn.linear_model.Perceptron
- sklearn.linear_model.SGDClassifier
- sklearn.linear_model.PassiveAggressiveClassifier
- sklearn.neural_network.MLPClassifier
- Regression(回归)
- sklearn.linear_model.SGDRegressor
- sklearn.linear_model.PassiveAggressiveRegressor
- sklearn.neural_network.MLPRegressor
- Clustering(聚类)
- sklearn.cluster.MiniBatchKMeans
- sklearn.cluster.Birch
- Decomposition/feature Extraction(分解/特征提取)
- sklearn.decomposition.MiniBatchDictionaryLearning
- sklearn.decomposition.IncrementalPCA
- sklearn.decomposition.LatentDirichletAllocation
- Preprocessing(预处理)
- sklearn.preprocessing.StandardScaler
- sklearn.preprocessing.MinMaxScaler
- sklearn.preprocessing.MaxAbsScaler
可以看到能够处理大规模数据的主要是一些简单的线性模型.