Skip to content

tulip-lab/modern-data-science

Repository files navigation

GitHub watchers GitHub Release Date GitHub commits since latest release (by SemVer) GitHub issues GitHub pull requests

GitHub watchers GitHub Release Date GitHub commits since latest release (by SemVer) GitHub issues GitHub pull requests

GitHub watchers GitHub forks GitHub stars


Modern Data Science

  • This course (unit) was originally designed for various elite class Bachelor students and Master students in some top Asia Pacific universities, including Deakin University (SIT742), and the partner university Southwest University etc. (since 2015).
  • Materials in this module include resources collected from various open-source online repositories.
  • If you found any issue/bug for this document, please submit an issue at GitHub issues
  • Prerequisite unit 👉 : GitHub watchers
  • Subsequent unit 👉 : GitHub watchers
  • Pull requests are welcome: GitHub pull requests
  • Point of Contact 👉 : Prof. Gang Li

Prepared by 🌷 TULIP Lab


💡 Content

Designed primarily for aspiring data scientists, this course (aka unit) lays the foundational groundwork for modern methods, techniques, and applications in data science. Upon successful completion, students will be able to utilize distributed storage and computing platforms to process and analyze big data, employing contemporary techniques in data analytics.

The learning activities in this course are structured to help students develop knowledge and skills in reviewing tabular data, such as relational databases and distributed storage systems, with a focus on platforms like Apache Spark. In the realm of data analytics, students will explore various data mining and machine learning methods. Additionally, students will have the opportunity to delve into advanced concepts such as differential privacy and frequent pattern discovery using association rule mining algorithms.

This course offers a blend of theory and practical application, aimed at providing a comprehensive mathematical toolkit essential for future data scientists.

📒 Modules

Students will have access to a comprehensive range of subject materials, comprising slides handouts, and relevant readings. It is recommended that students commence their engagement with each session by thoroughly reviewing the pertinent slides handouts and readings to obtain a comprehensive understanding of the content.

Additionally, students are encouraged to supplement their knowledge by conducting independent research, utilizing online resources or referring to textbooks that cover relevant information related to the topics under study.

This unit needs a total of 44 class hours, including 22 hours lecturing, and 22 hours workshops, interactive discussion or student presentations.

🗓️ Lectures Plan

The unit's lecture plan is scheduled around 6 different modules as below:

㊙️

All lecture slides handouts are password protected, and available for Deakin SIT742 students on CloudDeakin site.

🔬
Module
🏷️
Category
📒
Topic
🎯
ULOs
0️⃣ Preliminary 📖 Unit Induction ULO1
1️⃣ Preliminary 📖 Python Foundations for Big Data ULO1, ULO2
2️⃣ Core 📖 Big Data UL02, UL03
3️⃣ Core 📖 Big Data Manipulation ULO4, UL05
4️⃣ Core 📖 Big Data Analytics ULO4, UL05
5️⃣ Advanced 📖 Advanced Topics in Big Data UL01, ULO3

🗓️ Workshop Plan

The repository of this unit's workshop (practical classes) can be found at: GitHub watchers

You are recommended to do the practicals associated with every module. You may install your own Python package and Apache Spark, but it is much easier to use Cloud platform to run the materials, such as:

Students coming into this unit may be with various technical background, and you may schedule your own study based on your available time and background. We assume no prior knowledge on Python programming, though some understanding of computer programming will be optimal. The following is our recommended practical schedule.

🔬
Session
🏷️
Category
📒
Topic
1️⃣ 📖 Unit Induction M02C, M02D
2️⃣ 📖 Python Foundations for Data Science M02E, M02F
3️⃣ 📖 Python Foundations for Data Science M02G, M02H
4️⃣ 📖 Big Data M03D, M03E
5️⃣ 📖 Big Data M03X, M04A
6️⃣ 📖 Data Manipulation M04B, M04F
7️⃣ 📖 Data Manipulation M04G, M04H
8️⃣ 📖 Data Analytics M05A, M05B
9️⃣ 📖 Data Analytics M05C, M05E
🔟 📖 Advanced Topics in Data Science M06A, M06B
🏆 📖 Advanced Topics in Data Science M06D, M06E

🈵 Assessment

Every cohort of student might be assessed differently, depending on the specific requirements of your unit chair (professors of the university).

The assessment is mainly aimed at assessing the students' achievement of the Unit Learning Outcomes (ULOs, a.k.a. objectives), and checking the students' mastery of those theory and methods covered in the unit.

📖 Assessment Plan

The detailed assessment specification and marking rubrics can be found at: M00D-Assessment. The relationship between each assessment task and the ULOs are shown as follows:

🔬
Task
👨‍🏫
Category
🎯
ULO1
🎯
ULO2
🎯
ULO3
Percentage
1️⃣ Presentation 50% 25% 25% 30%
2️⃣ Project 25% 50% 25% 50%
3️⃣ Other 33% 33% 34% 20%

🗓️ Submission Due Dates

  • 2024 - The final assessment files submissions due date is 🗓️ Saturday, 27/07/2024 (tentative), group of one member only (individual work) for all tasks.

It is expected that you will submit the assessment component on time. You will not be allowed to start everything at the last moment, because we will provide you with feedback that you will be expected to use in future assessments.

㊙️

If you find that you are having trouble meeting your deadlines, contact the Unit Chair.

📚 References

This course recommended several key references:

👉 Contributors

Thanks goes to these wonderful people 🌷

Made with contributors-img.