机器学习（ML）在学术界和工业界的不同领域带来了重大的改变。ML日益增长的参与度，应用于诸如图像、语音识别、模式识别、优化、自然语言处理、推荐等广泛的应用中。

编程让计算机从经验中学习，最终应该可以消除大部分详细的编程工作。——Arthur Samuel 1959。

机器学习可分为四种主要技术：回归、分类、聚类和强化学习。这些技术以两种不同的形式解决不同性质的问题：监督式学习和无监督式学习。监督式学习需要在训练模型之前对数据进行标记和准备。无监督式学习则适用于处理未标记或具有未知特征的数据。本文不描述ML的概念或深入描述该领域中使用的术语。如果你完全是新手，请查看我的上一篇文章，开始你的ML学习之旅。

Java中的机器学习库

以下是Java中著名的ML库列表。我们将逐个描述它们，并使用其中一些框架给出实际示例。

每个库旁边的下列图标表示该框架默认提供的算法的主要类别。

Weka

Weka是由新西兰Waikato大学开发的开源库。Weka是用Java编写的，以通用机器学习而闻名。Weka提供名为ARFF的数据文件格式。ARFF分为两部分：标题和实际数据。标题描述属性及其数据类型。

Apache Mahout

Apache Mahout提供可扩展的机器学习库。Mahout使用MapReduce范例，可用于分类、协作过滤和聚类。Mahout利用Apache Hadoop处理多个并行任务。除了分类和聚类外，Mahout还提供推荐算法，如协作过滤，促进快速构建模型的可扩展性。

Deeplearning4j

Deeplearning4j是另一个专注于深度学习的Java库。它是Java深度学习的一种开源库。它也是用Scala和Java编写的，并且可以与Hadoop和Spark集成，提供高处理能力。当前版本为Beta版，但附带了优秀的文档和快速入门示例（点击此处）。

Mallet

Mallet代表语言工具包的机器学习。它是为自然语言处理提供的少数专门工具包之一。它提供了主题建模、文档分类、聚类和信息提取的功能。使用Mallet，我们可以使用ML模型处理文本文档。

Spark MLlib

Spark是用于加速处理大量数据的众所周知的框架。Spark MLlib还具有高功率算法，可在Spark上运行并插入Hadoop工作流程中。

Encog机器学习框架

Encog是一个用于机器学习的Java和C#框架。Encog提供了构建支持向量机(SVM)、神经网络(NN)、贝叶斯网络、隐马尔可夫模型(HMM)和遗传算法的库。Encog最初是一个研究项目，并在Google Scholar上获得了近一千次引用。

MOA

Massive Online Analysis (MOA)提供了用于分类、回归、聚类和推荐的算法。它还提供了用于离群值检测和漂移检测的库。它专为流式数据的实时处理而设计。

Weka示例：

我们将使用一个小的糖尿病数据集。我们将首先使用Weka加载数据：

import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;

public class Main {

    public static void main(String[] args) throws Exception {
        // Specifying the datasource
        DataSource dataSource = new DataSource("data.arff");
        // Loading the dataset
        Instances dataInstances = dataSource.getDataSet();
        // Displaying the number of instances
        log.info("The number of loaded instances is: " + dataInstances.numInstances());

        log.info("data:" + dataInstances.toString());
    }
}

数据集中有768个实例。让我们看看如何获取属性（特征）的数量，应该是9。

log.info("The number of attributes in the dataset: " + dataInstances.numAttributes());

在构建任何模型之前，我们想要确定哪一列是目标列，并查看此列中找到了多少个类：

// Identifying the label index
dataInstances.setClassIndex(dataInstances.numAttributes() - 1);
// Getting the number of 
log.info("The number of classes: " + dataInstances.numClasses());

加载数据集并确定目标属性后，现在是构建模型的时候了。让我们创建一个简单的树分类器，J48。

// Creating a decision tree classifier
J48 treeClassifier = new J48();
treeClassifier.setOptions(new String[] { "-U" });
treeClassifier.buildClassifier(dataInstances);

在上面的三行中，我们指定了一个选项来指示未修剪的树，并提供了数据实例进行模型训练。如果我们在训练后打印生成模型的树结构，我们可以跟随模型内部构建其规则的方式：

plas <= 127
|   mass <= 26.4
|   |   preg <= 7: tested_negative (117.0/1.0)
|   |   preg > 7
|   |   |   mass <= 0: tested_positive (2.0)
|   |   |   mass > 0: tested_negative (13.0)
|   mass > 26.4
|   |   age <= 28: tested_negative (180.0/22.0)
|   |   age > 28
|   |   |   plas <= 99: tested_negative (55.0/10.0)
|   |   |   plas > 99
|   |   |   |   pedi <= 0.56: tested_negative (84.0/34.0)
|   |   |   |   pedi > 0.56
|   |   |   |   |   preg <= 6
|   |   |   |   |   |   age <= 30: tested_positive (4.0)
|   |   |   |   |   |   age > 30
|   |   |   |   |   |   |   age <= 34: tested_negative (7.0/1.0)
|   |   |   |   |   |   |   age > 34
|   |   |   |   |   |   |   |   mass <= 33.1: tested_positive (6.0)
|   |   |   |   |   |   |   |   mass > 33.1: tested_negative (4.0/1.0)
|   |   |   |   |   preg > 6: tested_positive (13.0)
plas > 127
|   mass <= 29.9
|   |   plas <= 145: tested_negative (41.0/6.0)
|   |   plas > 145
|   |   |   age <= 25: tested_negative (4.0)
|   |   |   age > 25
|   |   |   |   age <= 61
|   |   |   |   |   mass <= 27.1: tested_positive (12.0/1.0)
|   |   |   |   |   mass > 27.1
|   |   |   |   |   |   pres <= 82
|   |   |   |   |   |   |   pedi <= 0.396: tested_positive (8.0/1.0)
|   |   |   |   |   |   |   pedi > 0.396: tested_negative (3.0)
|   |   |   |   |   |   pres > 82: tested_negative (4.0)
|   |   |   |   age > 61: tested_negative (4.0)
|   mass > 29.9
|   |   plas <= 157
|   |   |   pres <= 61: tested_positive (15.0/1.0)
|   |   |   pres > 61
|   |   |   |   age <= 30: tested_negative (40.0/13.0)
|   |   |   |   age > 30: tested_positive (60.0/17.0)
|   |   plas > 157: tested_positive (92.0/12.0)Number of Leaves  :  22Size of the tree :  43

Deeplearning4j示例：

这个示例将构建一个卷积神经网络(CNN)模型来对MNIST库进行分类。如果你不熟悉MNIST或CNN如何工作以对手写数字进行分类，请查看我的早期帖子，其中详细描述了这些方面。

像往常一样，我们将加载数据集并显示其大小。

DataSetIterator MNISTTrain = new MnistDataSetIterator(batchSize,true,seed);
DataSetIterator MNISTTest = new MnistDataSetIterator(batchSize,false,seed);

让我们再次确认是否从数据集中获得了十个唯一标签：

log.info("The number of total labels found in the training dataset " + MNISTTrain.totalOutcomes());
log.info("The number of total labels found in the test dataset " + MNISTTest.totalOutcomes());

接下来，让我们配置模型的架构。我们将使用两个卷积层加上一个用于输出的平坦层。Deeplearning4j有几个选项，你可以使用它们来初始化权重方案。

// Building the CNN model
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
        .seed(seed) // random seed
        .l2(0.0005) // regularization
        .weightInit(WeightInit.XAVIER) // initialization of the weight scheme
        .updater(new Adam(1e-3)) // Setting the optimization algorithm
        .list()
        .layer(new ConvolutionLayer.Builder(5, 5)
                //Setting the stride, the kernel size, and the activation function.
                .nIn(nChannels)
                .stride(1,1)
                .nOut(20)
                .activation(Activation.IDENTITY)
                .build())
        .layer(new SubsamplingLayer.Builder(PoolingType.MAX) // downsampling the convolution
                .kernelSize(2,2)
                .stride(2,2)
                .build())
        .layer(new ConvolutionLayer.Builder(5, 5)
                // Setting the stride, kernel size, and the activation function.
                .stride(1,1)
                .nOut(50)
                .activation(Activation.IDENTITY)
                .build())
        .layer(new SubsamplingLayer.Builder(PoolingType.MAX) // downsampling the convolution
                .kernelSize(2,2)
                .stride(2,2)
                .build())
        .layer(new DenseLayer.Builder().activation(Activation.RELU)
                .nOut(500).build())
        .layer(new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
                .nOut(outputNum)
                .activation(Activation.SOFTMAX)
                .build())
        // the final output layer is 28x28 with a depth of 1.
        .setInputType(InputType.convolutionalFlat(28,28,1))
        .build();

一旦设置了架构，我们就需要初始化模式，设置训练数据集并触发模型训练。

MultiLayerNetwork model = new MultiLayerNetwork(conf);
// initialize the model weights.
model.init();

log.info("Step2: start training the model");
//Setting a listener every 10 iterations and evaluate on test set on every epoch
model.setListeners(new ScoreIterationListener(10), new EvaluativeListener(MNISTTest, 1, InvocationType.EPOCH_END));
// Training the model
model.fit(MNISTTrain, nEpochs);

在训练期间，得分监听器将提供分类准确度的混淆矩阵。让我们看看训练十个时期后的准确性：

=========================Confusion Matrix=========================
    0    1    2    3    4    5    6    7    8    9
---------------------------------------------------
  977    0    0    0    0    0    1    1    1    0 | 0 = 0
    0 1131    0    1    0    1    2    0    0    0 | 1 = 1
    1    2 1019    3    0    0    0    3    4    0 | 2 = 2
    0    0    1 1004    0    1    0    1    3    0 | 3 = 3
    0    0    0    0  977    0    2    0    1    2 | 4 = 4
    1    0    0    9    0  879    1    0    1    1 | 5 = 5
    4    2    0    0    1    1  949    0    1    0 | 6 = 6
    0    4    2    1    1    0    0 1018    1    1 | 7 = 7
    2    0    3    1    0    1    1    2  962    2 | 8 = 8
    0    2    0    2   11    2    0    3    2  987 | 9 = 9

Mallet示例：

如前所述，Mallet是用于自然语言建模的强大工具包。我们将使用工具David Blei在Mallet包中提供的样本语料库。Mallet具有用于对文本标记进行分类的特定库。在加载数据集之前，Mallet具有流水线定义的概念，你可以在其中定义你的流水线，然后提供要通过的数据集。

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

该流水线被定义为一个“ArrayList”，其中包含我们在构建主题模型之前始终执行的典型步骤。文档中的每个文本都会经过以下步骤：

小写关键字
标记化文本
删除停用词
映射到特征

pipeList.add( new CharSequenceLowercase() );
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
// Setting the dictionary of the stop words
URL stopWordsFile = getClass().getClassLoader().getResource("stoplists/en.txt");
pipeList.add( new TokenSequenceRemoveStopwords(new File(stopWordsFile.toURI()), "UTF-8", false, false, false) );

pipeList.add( new TokenSequence2FeatureSequence() );

定义了流水线后，我们将传递代表每个文档的原始文本的实例。

InstanceList instances = new InstanceList (new SerialPipes(pipeList));

现在，步骤来了，我们要将输入文件传递以填充实例列表。

URL inputFileURL = getClass().getClassLoader().getResource(inputFile);
Reader fileReader = new InputStreamReader(new FileInputStream(new File(inputFileURL.toURI())), "UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
        3, 2, 1)); // data, label, name fields

从上一个命令行中，你可以注意到我们提供了有关如何构造CSV文件的说明。可在资源文件夹中找到源文件，其中大约有两千行。每行代表原始文档文本，并由逗号分隔为三个属性(名称、标签和文档内容)。我们可以使用以下命令打印在输入文档中找到的实例数：

log.info("The number of instances found in the input file is: " + instances.size());

现在，让我们对文档的主题进行建模。假设我们在这2k个文档中有100个不同的主题。Mallet使我们可以设置两个变量：alpha和beta权重。Alpha控制主题-词分布的集中度，而beta表示主题-词分布中每个词的先前权重。


int numTopics = 100;// defining the model 
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);
// adding the instances to the model
model.addInstances(instances);

在本示例中选择的模型是LDA(Latent Dirichlet allocation)的一种实现。该算法使用一组观察到的关键字相似性来对文档进行分类。

我喜欢Mallet的API能力，可以轻松设计并行处理。在这里，我们可以为每个子样本定义多线程处理。

model.setNumThreads(2);

我们只剩下两件事了，现在是定义模型训练的迭代次数并开始训练的时候了。

model.setNumIterations(50);
model.estimate();

我在Github上的完整示例中留下了有关如何显示主题建模结果的更多细节。

进一步阅读

[书籍]: 《Java机器学习》(Machine Learning in Java)作者：Boštjan Kaluža，由O’Reilly出版。
[书籍]: 《Java开发人员的端到端机器学习指南》(Machine Learning: End-to-End guide for Java developers)作者：Richard M. Reese、Jennifer L. Reese、Bostjan Kaluza、Dr. Uday Kamath、Krishna Choppella。
[教程] Spark MLlib示例。
[教程] Mallet机器学习。

本文提供的所有示例均可在我的Github上找到。``` 抱歉，我无法对空白内容进行翻译。请你提供具体的技术文章或内容。

译自：https://towardsdatascience.com/machine-learning-in-java-e335b9d80c14

Java中的机器学习