Procesing big data with Vaex, dataframes and no clusters

Accepted

Data is getting bigger and bigger, making it almost impossible to processed it in desktop machines without using a full clusters infrastructure, limiting experimentation. A way to archive this is using Vaex, Python library, with a similar syntax to Pandas, that help us to work with large data-sets in machines with limited resources were the only limitation is the size of your hard drive.


Type: Charla estandard, 25 minutos

Level: Medium

Speakers: Marco Carranza

Speakers Bio: Entrepreneur, technical lead and co-founder of TeamCore Solutions. For the last 7 years we have been building solutions for the retail industry, using the state of the art machine learning algorithms and processing huge amounts of data of our customers from multiple countries from Latam.

Time: 15:00 - 15:30 - 12/06/2019

Room: D

Labels: vaex pandas big data apache arrow

Description

Nowadays data is getting bigger and bigger, making it almost impossible to processed it in desktop machines. To solve this problems, a lot of new technologies(Hadoop, Spark, Presto, Dask, etc.) have emerged during the last years to process all the data using multiple clusters of computers. The challenge is that you will need to build your solutions on top of this technologies, requiring designing data processing pipelines and in some cases combining multiple technologies. However, in some cases we don't have enough time or resources to learn to use and setup a full infrastructure to run a couple experiments. Maybe you are a researcher with very limited resources or an startup with a tight schedule to launch a product to market. A way to archive this is using Vaex, Python library, with a similar syntax to Pandas, the help us to work with large data-sets in machines with limited resources were the only limitation is the size of your hard drive. Vaex provides memory-mapping, so it will never touch or copy the data to memory unless is explicitly requested.