Mon 26 Oct 2015 08:45 - 09:15 at Ellwood 1 - Session 1 - Real-world Data Chair(s): Eli Tilevich

Both industry and academia are confronting the challenge of big data, i.e., data processing that involves data so voluminous or arriving at such high velocity that no single commodity machine is capable of storing or processing them all. A common approach to handling big data is to divide and distribute the processing job to a cluster of machines. Ideally, a course that teaches students how to work with big data would provide students access to a cluster for hands-on practice. However, a cluster of physical machines may be prohibitively expensive, particularly at smaller institutions with smaller budgets.

In this report, we summarize our experiences developing and using a virtual cluster in a big data mining and analytics course at a small private liberal arts college. A single moderately-sized server hosts a cluster of virtual machines, which run the popular Apache Hadoop system. The virtual cluster gives students hands-on experience and costs less than an equal number of physical machines. It is also easily constructed and reconfigured. We describe our implementation, analyze its performance characteristics, and compare costs with physical clusters. We summarize our use of the virtual cluster in the classroom and show student feedback. For departments wishing to take a similar approach, we offer our software and curriculum under an open source license.

Mon 26 Oct

Displayed time zone: Eastern Time (US & Canada) change

08:30 - 10:00
Session 1 - Real-world DataSPLASH-E at Ellwood 1
Chair(s): Eli Tilevich Virginia Tech
08:30
15m
Day opening
SPLASH-E Introduction
SPLASH-E
Eli Tilevich Virginia Tech
08:45
30m
Talk
Teaching Big Data with a Virtual Cluster
SPLASH-E
Joshua Eckroth Stetson University
File Attached
09:15
30m
Talk
A Generic Framework for Engaging Online Data Sources in Introductory Programming Courses
SPLASH-E
Nadeem Hamid Berry College
File Attached
09:45
15m
Break
Session 1 Discussion
SPLASH-E