大数据分析与挖掘

Teaching contents

Chapter 1 Introduction (2 class hours)

Contents: In this chapter we begin with different formats of data from different sources and show the four important features of big data. Based on these features, the challenges and tasks are introduced.

1) What is Big Data

2) The features of Big Data

3) The challenges of Big Data

4) Main Tasks of Big Data

Requirements: Students should know the background of big data and understand the main features of big data. Meanwhile, students should be familiar with the main challenges and mining tasks of big data.

Chapter 2 Foundations of Data Mining (12 class hours)

Contents: In this chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute to this field. It covers the basic data mining techniques, including preprocessing, clustering, classification, association rule mining and outlier detection.

1) Data Preprocessing

2) Clustering

3) Classification

4) Association Rule Mining

5) Outlier Detection

Requirements: First, students should have idea of basic process and the common categories of data mining. In addition, students should grasp the basic algorithms of data mining, including k-means, DBSCAN, KNN, Naïve Bayes, Decision Tree, Apriori and LOF. Beyond, the widely used PCA is also key point which should be kept in mind.

Chapter 3 Finding Similar Items (4 class hours)

Contents: In this chapter, we introduce hashing methods which can be applied in similarity related tasks. The main contents including following items.

1) What is hashing and why use hashing

2) Shingling and MinHash

3) Locality-Sensitive Hashing

4) Learn to Hashing

Requirements: The basic Students should grasp the basic idea of MinHash and LSH and their applications. Besides, students should understand how to learn to hash.

Chapter 4 Sampling (4 class hours)

Contents: In this chapter we mainly introduce definition of sampling and most common sampling methods on static and dynamic data such as inverse transform sampling, rejection sampling, importance sampling, markov chain Monte Carlo sampling and reservoir sampling.

1) Basics of sampling

2) Inverse transform sampling

3) Rejection sampling

4) Importance sampling

5) Markov chain Monte Carlo(MH & Gibbs)

6) Reservoir sampling

Requirements: Students should know the condition and target of sampling and grasp the idea and process of popular sampling methods.

Chapter 5 Mining Data Streams (6 class hours)

Contents: In this chapter, we introduce special data format which called data stream. Based the definition of data stream, we show the features of data stream and give the challenges in data stream mining especially concept drift problem. Further, we discuss the data stream classification and clustering on data streams with concept drift.

1) Basics of data stream

2) Concept drift detection

3) Data stream classification

4) Data stream clustering

Requirements: Students should understand data stream about how it takes place, whether or not there are concept drifts, and based on or not concept drifts how to classify and cluster data stream.

Chapter 6 Graph Mining (6 class hours)

Contents: In this chapter, we offer you data with a special format, graph, which is also called network. Here, after you get knowledge of basic definition about network, two import mining tasks and related algorithms are introduced.

1) Basics of network

2) Key node identification

3) Community Detection

Requirements: Students should understand what is network and how network is denoted. Besides, students should know common network mining techniques such as community detection and key node identification.

Chapter 7 Hadoop and Spark (6 class hours)

Contents: In this chapter, we mainly introduce two distributed processing software framework, Hadoop and Spark. First, we show the architecture of hadoop and all kinds of components and how it works. Then starting from the limitation of MapReduce, we introduce the Spark framework and compare it to MapReduce.

1) Architecture of Hadoop

2) MapReduce and its key idea

3) Spark and its key idea

4) MapReduce V.S. Spark

Requirements: Students should know the basic concepts about Hadoop and Spark and grasp the key idea of divide-and-conquer applied into MapReduce. Finally, students should know the differences between MapReduce and Spark.

教学内容

第一章简介 (2 课时)

内容: 在本章中，我们将从不同来源不同数据格式开始，展示大数据的四个重要特征。基于这些特征，我们进一步介绍了大数据挖掘的挑战和任务。

1) 什么是大数据

2) 大数据的特点

3) 大数据的挑战

4) 大数据的主要任务

任务: 学生需要知道大数据的产生背景，了解大数据的主要特点。同时学生需要熟悉大数据的挑战和主要任务。

第二章大数据基础知识 (12 课时)

内容: 在本章中，我们将从数据挖掘的本质入手，讨论各种学科如何处理数据挖掘。它涵盖了基本的数据挖掘技术，包括预处理，聚类，分类，关联规则挖掘和离群值检测

1) 数据预处理

2) 聚类

3) 分类

4) 关联规则挖掘

5) 离群值检测

任务: 首先，学生应该对数据挖掘的基本过程和类别有所了解。此外，学生应掌握数据挖掘的基本算法，包括k均值，DBSCAN，KNN，朴素贝叶斯，决策树，Apriori和LOF。除此之外，广泛使用的PCA也是要牢记的重点。

第三章寻找相似事物 (4 课时)

内容: 在本章中，我们介绍了可用于相似性相关任务的哈希方法。主要内容包括以下项目

1) 什么是哈希，为什么使用哈希

2) Shingling 和 MinHash

3) 局部敏感哈希

4) 学习哈希

任务: 学生应掌握MinHash和局部敏感哈希的基本概念及其应用。此外，学生应了解如何学习哈希。

第四章采样 (4 课时)

内容: 本章主要介绍了静态和动态数据的采样定义和最常用的采样方法，例如逆变换采样，拒绝采样，重要性采样，马尔可夫链蒙特卡洛采样和蓄水池采样。

1) 采样基础

2) 逆变换采样

3) 拒绝采样

4) 重要性采样

5) 马尔可夫链蒙特卡洛采样(MH & Gibbs)

6) 蓄水池采样

任务: 学生应了解采样的条件和目标，并掌握流行的采样方法思想和过程。

第五章数据流挖掘 (6 课时)

内容: 在本章中，我们介绍一种称为数据流的特殊数据格式。基于数据流的定义，我们展示了数据流的特征，并给出了数据流挖掘中的挑战，特别是概念漂移问题。此外，我们讨论了在概念漂移的情况下数据流的分类和聚类。

1) 数据流基础

2) 概念漂移检测

3) 数据流分类

4) 数据流聚类

任务: 学生应了解数据流是如何产生的，数据流中是否存在概念漂移，以及在有和没有概念漂移的情况下对数据流进行分类和聚类。

第六章图挖掘 (6 课时)

内容:在本章中，我们介绍一种称为图的特殊数据格式，也称为网络。在了解有关网络的基本定义之后，介绍两种重要的图挖掘任务和相关算法。

1) 网络数据基础

2) 关键节点识别

3) 社区发现

任务: 学生应了解什么是网络以及如何表示网络。此外，学生应了解常见的网络挖掘技术，例如社区发现和关键节点识别。

第七章 Hadoop 和 Spark (6 课时)

内容: 在本章中，我们主要介绍两个分布式处理软件框架Hadoop和Spark。首先，我们展示hadoop的体系结构以及各种组件以及它是如何工作的。然后从MapReduce的局限性出发，介绍Spark框架并将其与MapReduce进行比较。

1) Hadoop架构

2) MapReduce核心思想

3) Spark核心思想

4) MapReduce V.S. Spark

任务: 学生应了解有关Hadoop和Spark的基本概念，并掌握MapReduce中的关键思想分治法。最后，学生应该了解MapReduce和Spark之间的区别。