The world of data science is a practice grounded in statistics but progressively being encompassed by software and systems. Data scientists strive to implement their work beyond simple research but bridging the gaps between the language of the data scientist and the speak of distributed systems proves to be increasingly difficult. Factor in a fast evolving ecosystem of tools and libraries, many being delivered weekly, and you have a recipe for distraction.
Cloudera has introduced the Data Science Workbench, an enterprise data science platform that accelerates analytics projects from exploration to production. It is a collaborative, scalable, and highly extensible platform for data exploration, analysis, modeling, and visualization. It includes powerful features to bring data scientists, analysts, and business teams together.
Find out more using the links below:
Today, leading organizations struggle to make their data scientists productive with Hadoop clusters. Data scientists find it difficult to use their existing open source languages (e.g. Python, R) and libraries with Hadoop, especially when the clusters are secured with Kerberos. At the same time, IT doesn't want to give special access to these users, who require very diverse and specific environment configurations to run their experiments. As a result, most data science teams work away from the Hadoop cluster, often on their laptops or in other data silos. The negative business impacts are a lack of insight and agility for the most advanced users, and the security, governance, and cost issues that arise from data silos.
Cloudera Data Science Workbench is a new tool, under development, that will enable collaborative, customizable, self-service access by data scientists to secure Hadoop environments via Python, R, and Scala. It can be installed on any existing cluster, whether on-premises or in the cloud.
Matt Brandwein, Director of Product Management at Cloudera and Tristan Zajonic, Senior Engineering Manager discuss:
Machine Learning and Deep Learning present an advanced opportunity for us to understand data beyond simple numbers and text. Data Science practitioners want to quickly implement new machine learning and deep learning libraries but have few options for enterprise analytics systems that support these new tools. The Cloudera Data Science Workbench helps data scientists get ready-access to Hadoop data, leverage the newest machine learning and deep learning frameworks and deliver value much quicker; all in a secure environment.
Join Sean Anderson, Senior Manager of Data Science Marketing at Cloudera and Vartika Singh, Solutions Architect for Data Science at Cloudera as they discuss:
"I've built a model -- now what?"
Developing a predictive model is only one part of a larger journey. Data scientists have to access and transform data, and engineer features, before exploratory modeling happens. A model doesn't do anything until it's applied to data, productionised and deployed.
Apache Hadoop can support all stages of the data science lifecycle, but how this is done is still more art than science, as it requires coordinating different teams and technologies. This webinar will demonstrate a simple reference architecture for connecting the output of exploratory data science in Cloudera Data Science Workbench with production deployment on Hadoop. This includes data engineering with Spark, modeling with Spark MLlib, and production build and deployment via git, Maven and Spark Streaming.
Matt is Director of Product Management at Cloudera, driving the platform's experience for data science and data engineering users. Before that, he led Cloudera's product marketing team for three years, with roles spanning product, solution, and partner marketing. Prior to Cloudera, he built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in Computer Science and Mathematics from the University of Massachusetts Amherst.
Tristan Zajonc is a senior engineering manager at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and an MPA in international development from Harvard and a BA in economics from Pomona College.
Sean is a tenured infrastructure scaling and cloud strategy consultant with a strong focus on strategic partnerships and innovative hybrid technology. He has been a part of integral shifts in technology including the rise of cloud computing, open source standardization, big data, and machine learning. Sean quickly became a go-to resource and speaker for data specific workloads focusing on technologies like machine learning, data science, Apache Hadoop, MongoDB, Redis, ElasticSearch, SQL, and Data Warehousing. At Rackspace Hosting, Sean helped bring to market and launch open-source cloud platforms around Hadoop, MongoDB, and Redis. Sean is currently senior marketing manager for data science and data engineering at Cloudera; the pioneers of Apache Hadoop. Sean focuses on modern data science practices involving popular open-source languages like Python, R, and Scala and speaks often about the convergence of big data and machine learning/AI.
Sean is Director of Data Science at Cloudera, based in London. Before Cloudera, he founded Myrrix Ltd, a company commercializing large-scale real-time recommender systems on Apache Hadoop. He has been a primary committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google. He holds an MBA from the London Business School and a BA in Computer Science from Harvard.