- What’s the difference between Multimodal Machine Learning and Multi-view Learning? How to define the specific concept? Are the two terms refer to the same thing?
- Is the the RPSVM-2V solution just linear Integration of RSVM and PSVM-2V? According to the data completeness, jump to the certain branch by “IF”?
- 0.1 KL(Kernel Methods)
- Support Vector Machines (SVM)
- Radial Basis Function (RBF)
- Linear Discriminate Analysis (LDA)
Kernels or kernel methods (also called Kernel functions) are sets of different types of algorithms that are being used for pattern analysis. They are used to solve a non-linear problem by using a linear classifier. Kernels Methods are employed in SVM (Support Vector Machines) which are used in classification and regression problems.
Support Vector Machines (SVM)
In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis.
最优参数的SVM决策函数:
$$
f(x)=\text{sgn}(w^{T}x+b^)=\text{sgn}(\sum_{n=1}^N \lambda_n^y^n(x^n)^Tx+b^)
$$
在一个变换后的特征空间中,SVM的决策函数:
$$
f(x) = sgn(w^T\phi(x)+b^*) = sgn(\sum_{n=1}^N \lambda_n^y^nK(x^n,x)x+b^)
$$
其中核函数(Kernel)为:
Radial Basis Function (RBF)
A radial basis function (RBF) is a real-valued function φ whose value depends only on the distance between the input and some fixed point, either the origin, so that φ(𝐱)=φ(|𝐱|), or some other fixed point 𝐜, called a center, so that φ(𝐱)=φ(|𝐱-𝐜|). Any function φ that satisfies the property φ(𝐱)=φ(|𝐱|) is a radial function. The distance is usually Euclidean distance, although other metrics are sometimes used.
Linear Discriminate Analysis (LDA)
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher’s linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
- 0.2 MKL(Multiple Kernel Learning)
- Supervised learning
- Fixed rules approaches
- Heuristic approaches
- Optimization approaches
- Bayesian approaches
- Boosting approaches
- Semi-Supervised learning
- UnSupervised learning
- Supervised learning
Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.
Multiple kernel learning (MKL) algorithms aim to find the best convex combination of a set of kernels to form the best classifier. Many algorithms have been presented in recent years and they form two classes.
Supervised learning
Fixed rules approaches
Heuristic approaches
These algorithms use a combination function that is parameterized. we can define
Other approaches use a definition of kernel similarity, such as
Optimization approaches
These approaches solve an optimization problem to determine parameters for the kernel combination function. This has been done with similarity measures and structural risk minimization approaches. For similarity measures such as the one defined above, the problem can be formulated as follows:[9]
$$
\max_{\beta,\operatorname {tr}(K’{tra})=1,K’\geq 0}A(K’{tra},YY^{T}).
$$
Bayesian approaches
Bayesian approaches put priors on the kernel parameters and learn the parameter values from the priors and the base algorithm. For example, the decision function can be written as
Boosting approaches
Boosting approaches add new kernels iteratively until some stopping criteria that is a function of performance is reached. An example of this is the MARK model developed by Bennett et al. (2002) [14]
Semi-Supervised learning
Let
The problem can be written as
where
where
where
UnSupervised learning
Combining these terms, we can write the minimization problem as follows.
where . One formulation of this is defined as follows. Let
- 0.3 MVL(Multi-view Learning)
Data in reality often exhibit in multi-modal forms, called multi-view data. Each view distributed in distinct feature spaces
describes different attributes of the same object with high dimension, strong heterogeneity and rich description. The performance of multi-view learning models compares more favorably than that of single-view learning models. The existing MVL
approaches either comply with the consensus principle or the complementary principle. Here we mainly review SVM-based
MVL classification methods.
- 1.1 Multi-view data and learning
The simplest way is to train the model on the concatenated view. However, this approach ignores the correlation and interaction among views, which may cause overfitting and dimension disasters [3,4]. In this case, multi-view learning (MVL) has emerged and achieved great success in classification [5], clustering[6,7], feature selection [8,9], etc.
With the help of different kinds sensor equipment, data tracking from different aspects is much easier. In most of the cases, multi modal data gives a better view of a certain event, characterising the specific object in comprehensive way, by providing more information.
- 1.2 Incomplete multi-view data
In real-world applications, we often confront an obstacle that only partial views are available, due to the difficulties of high
cost, equipment failure, and so on [13]. Thus, learning with multiple incomplete views is a challenging yet valuable work.
Unfortunately, the aforementioned strategies may either lose important information or introduce some errors, especially for the case where the amount of incomplete multi-view data is particularly large. In contrast, learning from massive complete-view data without view-missing is also complicated and time-consuming.
- reduced support vector machine (RSVM)
- multi-view privileged support vector machine (PSVM-2V)
- mix solution RPSVM-2V(Integration with RSVM and PSVM-2V)
RSVM
The RSVM can be expressed as follows:
PSVM-2V
To begin with, we can directly reformulate PSVM-2V (1) for incomplete-view learning, as shown below.
$$
s.t. \enspace \enspace \enspace \enspace |[[<W_{A_1},\Phi_{A_1}>]]l^r-[[<w_{A_2}>,\Phi{A_2}]]_l^r |\le \eta+e\epsilon,
$$
$$
[[\xi^{A_2}]]l^r \ge [[D_1<W_{A_1,\Phi_{A_1}}>]]l^r,
\xi^{A_1},\xi^{A_2},\eta \ge 0,
Then problem (3) can be rewritten as follows.
$$
s.t.
\enspace \enspace \enspace \enspace
|[[K_{A_1}(A_1,A_1’)v_{A_1}]]l^r-[[K{A_2}(A_2,\overline{A_2’}v_{A_2})]]| \le \eta + e \epsilon
$$
$$
[[\xi^{A_1}]]l^r \ge [[D_2(K{A2,A_2’}v_{A_2})]]_l^r
$$
$$
[[\xi^{A_2}]]l^r \ge [[D_1(K{A1,A_1’}v_{A_1})]]_l^r
$$
RPSVM-2V(*****)
Formally, RPSVM-2V can be built as follows.
$$
[[\xi^{A_1}]]l^r \ge [[D_2 \widetilde{K{A_2}}(A_2,\overline{A_2’})\widetilde{v_{A_2}}]]_l^r
$$
$$
[[\xi^{A_2}]]l^r \ge [[D_1 \widetilde{K{A_1}}(A_1,\overline{A_1’})\widetilde{v_{A_1}}]]_l^r
\xi^{A_1},\xi^{A_2},\eta \ge 0.
$$
- generalization error bound
- Generalized performance
- parameter study
In this section, proof and algorithm are stated in detial for the final expriment output.
In this paper, we propose a new model RPSVM-2V by integrating RSVM with PSVM-2V in incomplete- and complete-view
scenarios. RPSVM-2V can not only fully leverage all available information for the incomplete views case, but also achieve
comparable performance with less computation and storage cost in the complete-view setting.