- This course (unit) was originally designed for various elite class Bachelor students and Master students in some top Asia Pacific universities, including Deakin University (SIT742), and the partner university Southwest University etc. (since 2015).
- Materials in this module include resources collected from various open-source online repositories.
- If you found any issue/bug for this document, please submit an issue at
- Prerequisite unit 👉 :
- Subsequent unit 👉 :
- Pull requests are welcome:
- Point of Contact 👉 : Prof. Gang Li
Prepared by 🌷 TULIP Lab
Designed primarily for aspiring data scientists, this course (aka unit) lays the foundational groundwork for modern methods, techniques, and applications in data science. Upon successful completion, students will be able to utilize distributed storage and computing platforms to process and analyze big data, employing contemporary techniques in data analytics.
The learning activities in this course are structured to help students develop knowledge and skills in reviewing tabular data, such as relational databases and distributed storage systems, with a focus on platforms like Apache Spark. In the realm of data analytics, students will explore various data mining and machine learning methods. Additionally, students will have the opportunity to delve into advanced concepts such as differential privacy and frequent pattern discovery using association rule mining algorithms.
This course offers a blend of theory and practical application, aimed at providing a comprehensive mathematical toolkit essential for future data scientists.
Students will have access to a comprehensive range of subject materials, comprising slides handouts, and relevant readings. It is recommended that students commence their engagement with each session by thoroughly reviewing the pertinent slides handouts and readings to obtain a comprehensive understanding of the content.
Additionally, students are encouraged to supplement their knowledge by conducting independent research, utilizing online resources or referring to textbooks that cover relevant information related to the topics under study.
This unit needs a total of 44 class hours, including 22 hours lecturing, and 22 hours workshops, interactive discussion or student presentations.
The unit's lecture plan is scheduled around 6 different modules as below:
㊙️
All lecture slides handouts are password protected, and available for Deakin SIT742 students on CloudDeakin site.
🔬 Module |
🏷️ Category |
📒 Topic |
🎯 ULOs |
---|---|---|---|
0️⃣ | Preliminary | 📖 Unit Induction | ULO1 |
1️⃣ | Preliminary | 📖 Python Foundations for Big Data | ULO1, ULO2 |
2️⃣ | Core | 📖 Big Data | UL02, UL03 |
3️⃣ | Core | 📖 Big Data Manipulation | ULO4, UL05 |
4️⃣ | Core | 📖 Big Data Analytics | ULO4, UL05 |
5️⃣ | Advanced | 📖 Advanced Topics in Big Data | UL01, ULO3 |
The repository of this unit's workshop (practical classes) can be found at:
You are recommended to do the practicals associated with every module. You may install your own Python package and Apache Spark, but it is much easier to use Cloud platform to run the materials, such as:
- Google Colab: which will be used in SIT742 practical classes.
- Databricks - Community version: the original contributor to Apache Spark.
Students coming into this unit may be with various technical background, and you may schedule your own study based on your available time and background. We assume no prior knowledge on Python programming, though some understanding of computer programming will be optimal. The following is our recommended practical schedule.
🔬 Session |
🏷️ Category |
📒 Topic |
---|---|---|
1️⃣ | 📖 Unit Induction | M02C, M02D |
2️⃣ | 📖 Python Foundations for Data Science | M02E, M02F |
3️⃣ | 📖 Python Foundations for Data Science | M02G, M02H |
4️⃣ | 📖 Big Data | M03D, M03E |
5️⃣ | 📖 Big Data | M03X, M04A |
6️⃣ | 📖 Data Manipulation | M04B, M04F |
7️⃣ | 📖 Data Manipulation | M04G, M04H |
8️⃣ | 📖 Data Analytics | M05A, M05B |
9️⃣ | 📖 Data Analytics | M05C, M05E |
🔟 | 📖 Advanced Topics in Data Science | M06A, M06B |
🏆 | 📖 Advanced Topics in Data Science | M06D, M06E |
Every cohort of student might be assessed differently, depending on the specific requirements of your unit chair (professors of the university).
The assessment is mainly aimed at assessing the students' achievement of the Unit Learning Outcomes (ULOs
, a.k.a. objectives), and checking the students' mastery of those theory and methods covered in the unit.
The detailed assessment specification and marking rubrics can be found at: M00D-Assessment. The relationship between each assessment task and the ULOs are shown as follows:
🔬 Task |
👨🏫 Category |
🎯 ULO1 |
🎯 ULO2 |
🎯 ULO3 |
Percentage |
---|---|---|---|---|---|
1️⃣ | Presentation | 50% | 25% | 25% | 30% |
2️⃣ | Project | 25% | 50% | 25% | 50% |
3️⃣ | Other | 33% | 33% | 34% | 20% |
- 2024 - The final assessment files submissions due date is 🗓️
Saturday, 27/07/2024
(tentative), group of one member only (individual work) for all tasks.
It is expected that you will submit the assessment component on time. You will not be allowed to start everything at the last moment, because we will provide you with feedback that you will be expected to use in future assessments.
㊙️
If you find that you are having trouble meeting your deadlines, contact the Unit Chair.
This course recommended several key references:
- Mining of Massive Datasets
- Doing Data Science: Straight Talk from the Frontline, by Cathy O'Neil, Rachel Schutt
- Learning Spark: Lightning-Fast Data Analytics, by Jules S. Damji, et al.
Thanks goes to these wonderful people 🌷
Made with contributors-img.