Data Mining

Rule Mining

Goal: Identify items that are bought together by sufficiently many customers. We will do so by finding dependencies among items from data collected. Our input is composed of a table of items and a table of baskets. Each basket is a list of items purchased together. Then, we use the baskets to find dependencies of the form $\{x, y, z\} \rightarrow \{v, w\}$. Definition: Support The support of a set of items $I$ is the number of baskets that contain all the items in $I$....

Alternative classification techniques

K-Nearest Neighbors classifier The idea is to represent each record in the data set as an element in $\mathbb{R}^n$ $\DeclareMathOperator*{\argmax}{arg \,max \,} \DeclareMathOperator*{\argmin}{arg \,min \,}$. Then, to predict the class of a new point $x$, compute the $k$ points that are nearest to $x$. The majority class of these $k$ points is the predicted class of $x$. To run this algorithm, we need to define a distance function and also a value for $k$....

k-cores and Densest Subgraph

Definition: Induced subgraph A graph $H = (V_H, E_H)$ is an induced subgraph of $G = (V_G, E_G)$ if $V_H \subseteq V_G$ and if $u, v \in V_H$ and $(u, v) \in E_G$, then $(u, v) \in E_H$. We will say that $\delta_G(v)$ is the number of edges incident to $v$ in $G$. Definition: $k$-core Given a graph $G$ and $k \geq 0$, a subgraph $H$ of $G$ is a $k$-core if:...

Decision Trees & Random Forests

Classification models We are interested in classifying a data set among many classes. Each point in the data set has many attributes. Those can be either discrete or continuous, but the classes can only be discrete. If a continuous class is required, then we should use a regression model. It is also worth noting that the classes have no order relation (we cannot say that class 5 is greater than class 2)....

Clustering

Our goal in clustering is to group similar data points together. Each group will be called a cluster. Ideally, the intra-cluster distances are minimized and the inter-cluster distances are maximized. Note that this is an unsupervised model, so the following cannot be considered as clustering: Supervised classification; Simple segmentation; Results of a query; Graph partitioning. There are two types of clustering: Partitional clustering: divide data into non-overlapping subsets & each data is in exactly one subset; Hierarchical clustering: A set of nested clusters organized as a hierarchical tree....

Page Rank & Community detection

Page Rank Let $G = (V, E)$ be a web graph. Each node $v \in V$ represents a web page, while there is a directed edge from $u \in V$ to $v \in V$ if $u$ has a hyperlink to $v$. The importance of $v$ is proportional to the importance of nodes linking to $v$. Let $\delta_\text{in}(v)$ and $\delta_\text{out}(v)$ be the in and out degrees of $v$ in $G$. Then we can define the matrix $M$ as follows:...