Both industry and academia are confronting the challenge of big data, i.e., data processing that involves data so voluminous or arriving at such high velocity that no single commodity machine is capable of storing or processing them all. A common approach to handling big data is to divide and distribute the processing job to a cluster of machines. Ideally, a course that teaches students how to work with big data would provide students access to a cluster for hands-on practice. However, a cluster of physical machines may be prohibitively expensive, particularly at smaller institutions with smaller budgets.
In this report, we summarize our experiences developing and using a virtual cluster in a big data mining and analytics course at a small private liberal arts college. A single moderately-sized server hosts a cluster of virtual machines, which run the popular Apache Hadoop system. The virtual cluster gives students hands-on experience and costs less than an equal number of physical machines. It is also easily constructed and reconfigured. We describe our implementation, analyze its performance characteristics, and compare costs with physical clusters. We summarize our use of the virtual cluster in the classroom and show student feedback. For departments wishing to take a similar approach, we offer our software and curriculum under an open source license.
(jeckroth-teaching big data.pdf) | 158KiB |
Mon 26 OctDisplayed time zone: Eastern Time (US & Canada) change
08:30 - 10:00 | |||
08:30 15mDay opening | SPLASH-E Introduction SPLASH-E Eli Tilevich Virginia Tech | ||
08:45 30mTalk | Teaching Big Data with a Virtual Cluster SPLASH-E Joshua Eckroth Stetson University File Attached | ||
09:15 30mTalk | A Generic Framework for Engaging Online Data Sources in Introductory Programming Courses SPLASH-E Nadeem Hamid Berry College File Attached | ||
09:45 15mBreak | Session 1 Discussion SPLASH-E |