Improving feature engineering in the lab and production with Ivory
Feature engineering is a critical and time-consuming activity in the development and deployment of any modeling pipeline. It is also exacerbated as data science teams seek to incorporate new data sources into their pipelines that are at a scale far larger than previously employed. Furthermore, the transition to production environments is littered with complexity as these pipelines are exposed to the dynamic, and fragile, world of ongoing data feeds, data corrections, and evolving data models.
In this talk we will introduce Ivory, a new open-source, Hadoop-based data store that seeks to address these challenges. Ivory is a scalable and extensible data store for storing facts and extracting features. It is optimised specifically for the feature engineering stages of modelling pipelines, simultaneously simplifying and adding rigour to them.
This session will walk through an example of how Ivory can be used in the typical data scientist’s workflow, and then how that extends to migrating pipelines into production. It will impart all of the basic concepts of Ivory such as repositories, the dictionary, its fact-based data model, and virtual features. It will also demonstrate the benefits of Ivory being an immutable data store and the unique opportunities that creates.
Ben Lever is a co-founder and the CTO of Ambiata, a startup focused on creating products that allow organisations to take a more scientific and automated approach to business. At Ambiata he has lead the deployment of large scale machine learning systems into enterprises in industries such as finance, telecommunications, retail, and insurance. Before Ambiata, Ben previously led an engineering and research team at NICTA, as well as started the open source Scoobi project.