Data Mining

Rule Mining

Goal: Identify items that are bought together by sufficiently many customers. We will do so by finding dependencies among items from data collected. Our input is composed of a table of items and a table of baskets. Each basket is a list of items purchased together. Then, we use the baskets to find dependencies of the form $\{x, y, z\} \rightarrow \{v, w\}$. Definition: Support The support of a set of items $I$ is the number of baskets that contain all the items in $I$. Given a support threshold, we can classify itemsets as frequent of not. ...

Alternative classification techniques

K-Nearest Neighbors classifier The idea is to represent each record in the data set as an element in $\mathbb{R}^n$ $\DeclareMathOperator*{\argmax}{arg \,max \,} \DeclareMathOperator*{\argmin}{arg \,min \,}$. Then, to predict the class of a new point $x$, compute the $k$ points that are nearest to $x$. The majority class of these $k$ points is the predicted class of $x$. To run this algorithm, we need to define a distance function and also a value for $k$. ...

k-cores and Densest Subgraph

Definition: Induced subgraph A graph $H = (V_H, E_H)$ is an induced subgraph of $G = (V_G, E_G)$ if $V_H \subseteq V_G$ and if $u, v \in V_H$ and $(u, v) \in E_G$, then $(u, v) \in E_H$. We will say that $\delta_G(v)$ is the number of edges incident to $v$ in $G$. Definition: $k$-core Given a graph $G$ and $k \geq 0$, a subgraph $H$ of $G$ is a $k$-core if: ...

Decision Trees & Random Forests

Classification models We are interested in classifying a data set among many classes. Each point in the data set has many attributes. Those can be either discrete or continuous, but the classes can only be discrete. If a continuous class is required, then we should use a regression model. It is also worth noting that the classes have no order relation (we cannot say that class 5 is greater than class 2). ...

Clustering

Our goal in clustering is to group similar data points together. Each group will be called a cluster. Ideally, the intra-cluster distances are minimized and the inter-cluster distances are maximized. Note that this is an unsupervised model, so the following cannot be considered as clustering: Supervised classification; Simple segmentation; Results of a query; Graph partitioning. There are two types of clustering: Partitional clustering: divide data into non-overlapping subsets & each data is in exactly one subset; Hierarchical clustering: A set of nested clusters organized as a hierarchical tree. Types of clusters Well-separated cluster: any point in the cluster is closer to every other point in the cluster than to any point not in the cluster; Center-based cluster: An object in the cluster is closer to its center than to the center of other clusters. The center is usually the centroid or medoid (most representative point); Contiguous cluster: a point in the cluster is closer to one or more other points in the cluster than to any point not in the cluster; Density-based cluster: A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density; Conceptual cluster: Clusters that share some common property or represent a particular concept. K-means clustering Input: A set $S$ of points in the euclidean space and an integer $k > 0$. Output: A parititonal clustering of $S$. ...