Web Data Extraction Summit

2021 Speakers

How do we measure extraction quality

Measuring extraction quality is important both for people who want to find the best performing service according to their needs, and for developers of the extraction services, who want to make sure they are making progress and choosing the best solutions. In this session, Konstantin will take a deep dive into how to perform quality evaluation, and what are the common pitfalls specific to web data extraction. He will also share his experience from doing two open evaluations for articles and products, and the process which is used for internal quality assessment.

Konstantin Lopukhin

Head of Data Science at Zyte

Konstantin Lopukin is the Head of Data Science team at Zyte, working on improving the quality of Automatic Extraction, which allows the extraction of products, articles, job postings and other data types from any website. He collaborates with the Data Science team, developing state of the art models and algorithms for web data extraction, improving datasets and methods of quality assessment. In the last two years, Konstantin and the team published two open evaluations of article and product extraction quality, comparing commercial services and open source solutions.

Need to scrape 20 websites in 3 hours?

Professional web scraping is a complex technical task. You need to inspect the website, write the crawling and extraction logic, figure out how to make the downloading work, and then wait for a next website change to break everything you’ve achieved.
At Zyte, we’ve been developing a new hybrid web data extraction approach, combining ML-based scraping and crawling with custom code. It allows to simplify the development, extract data from websites much faster, and spend less time on maintenance later.

In this talk Mikhail Korobov will explain how it works, and show how you can use this approach in your own web scraping projects - what tools can be used, what are the best practices, and what are the caveats.

Mikhail Korobov

Head of Development at Zyte

Mikhail is a long-term Scrapy contributor and an open-source enthusiast. For the past 7 years, Mikhail has been developing smart web crawlers. He’s currently leading the development of the Zyte Automatic Extraction technology.

AMA: Your web scraping questions answered

Join Theresia Tanzil, Peng-Yu Chen, Kevin Lloyd Bernal, Evgeny Slaikowvsky and Nikita Vostretsov for an AMA (Ask Me Anything) session where they will answer any technical questions you have about web data extraction. These panelists are experts in the fields of Data science, Reverse Engineering, Solution architecture, and Anti-bans. So bring forth all your technical queries!
The star-studded panel will be hosted by Suzanne Hassett, COO and Head of Delivery at Zyte.

Hosted by Suzanne Hassett

COO and Head of Delivery at Zyte

Suzanne Hassett is Chief Operations Officer and Head of Delivery at Zyte. She runs a 100+ team of developers looking after delivering data to customers.

Prior to Zyte, Suzanne held positions including Chief Operations Officer and Global Head of Delivery at Britebill (acquired by Amdocs 2016) and Professional Services Director at Norkom (acquired by BAE Systems in 2011).
Suzanne has 25+ year working in software businesses in Ireland and Internationally and holds a B. Sc from DCU and an MBA from UCD.

Workshop: Building a real estate market monitoring tool with Scrapy, TimescaleDB and Apache Superset

During this workshop, Attila Tóth is going to show you a by-step guide how to build your own personal real estate market monitoring tool. He is going to show how Scrapy (+Smart Proxy Manager) and TimescaleDB can power a data exploration tool built with Apache Superset.

Attila Tóth

Developer Advocate at Timescale

Attila Tóth is a Software Developer passionate about making data useful. He likes to craft data pipelines and visualisations to gain insights from large amounts of data. Attila enjoys sharing his knowledge with fellow developers through tutorials, live coding streams and technical talks.

You can reach him at attilatoth.dev.

AMA: Your web scraping questions answered

Join Peng-Yu Chen, Theresia Tanzil, Kevin Lloyd Bernal, and Nikita Vostretsov for an AMA (Ask Me Anything) session where they will answer any technical questions you have about web data extraction. These panelists are experts in the fields of Data science, Reverse Engineering, Solution architecture, and Anti-bans. So bring forth all your technical queries!
The star-studded panel will be hosted by Suzanne Hassett, COO and Head of Delivery at Zyte.

Peng-Yu Chen

Principal Reverse Engineer at Zyte

Peng-Yu Chen is the Principal Reverse Engineer of the anti-ban team at Zyte. He works on analysing observed web crawl bans to understand the cause, and on carrying out improvements to the crawling mechanisms to bypass such bans. He also works on browser customisations specifically for the purposes of web crawling, so that better anti-ban performance may be achieved as a generic renderer.

Adaptive learning with PandioML

The session will cover using adaptive learning with PandioML to connect to data, feed it into a pipeline, train the model, perform inference, deploy to production on the Pandio.com platform, and perform drift detection on the production model. The data used and model trained is subject to change, but will be something along the lines of extracting reviews from the web (Yelp), creating a risk model, then detecting when the data drifts to reset the model.

Joshua Odmark

CTO at Pandio

Joshua is the CTO at Pandio and an expert in front-end and back-end software engineering with PHP, JavaScript, Ruby, Python, HTML, CSS, Apache Pulsar, BookKeeper, Zookeeper, MySQL, PostgreSQL, MongoDB, Presto, and Redis. He is a full-stack engineer with tons of experience in dev ops, data science, distributed systems, machine learning, artificial intelligence, leadership, project management, and product management.

Joshua is currently focusing on building a data-driven future of Pandio.

AMA: Your web scraping questions answered

Join Theresia Tanzil, Peng-Yu Chen, Kevin Lloyd Bernal, and Nikita Vostretsov for an AMA (Ask Me Anything) session where they will answer any technical questions you have about web data extraction. These panelists are experts in the fields of Data science, Reverse Engineering, Solution architecture, and Anti-bans. So bring forth all your technical queries!
The star-studded panel will be hosted by Suzanne Hassett, COO and Head of Delivery at Zyte.

Theresia Tanzil

Business Operations Manager at Zyte

Theresia is the Business Operations Manager at Zyte. Her mission is to help customers transform their business goals into technical requirements. She applies and builds upon her years of experience in software engineering and customer-facing roles to data acquisition projects and works closely with various internal and external teams to ensure proposed solutions are legally compliant, technically feasible, and commercially sound.

From code to data: Live-coding a small blog scraper

Join Jônatas Paganini’s live-coding session and learn how to create a small scraper to recursively navigate into all the web pages from a single domain. Later he'll exploring the extracted data he has from his first POC downloading 200k pages from a list of engineering blogs. He'll be exposing statistics while dive into SQL queries with Postgres. Exploring native text search, time-series, and statisticf unctions.

Jônatas Davi Paganini

Developer Advocate at Timescale

Jônatas is a Data nerd, cyclist, and blogger. He is a pair programming evangelist and his highlight experiences involve performance and architecture. Jônatas spent a few years writing automated strategies for financial markets, processing millions of events per day.

Panel: Legal hot topics in web data extraction

There have been some key updates to the legal framework of web scraping in 2020- 2021. In this panel, Legal Counsel at Zyte, Kate O’Brien, brings together a panel of legal experts in the field of data extraction to discuss the various aspects of web scraping compliance and updates to the legal landscape. Discussion topics include legal implications of accepting terms of service on a target website, updates and insights from the US Supreme Court Van Buren v. United States opinion, and how to be compliant with the General Data Protection Regulation (GDPR).

Tricia Higgins

CEO at Fort Privacy

Tricia is CEO at Fort Privacy which she co-founded with Marie Murphy in 2017.

Fort Privacy uses a structured, multi-disciplinary approach to enabling compliant data processing activities by it’s clients. Tricia has extensive experience working with clients on their data protection compliance programs.
Tricia obtained a law degree from University College Cork in 1999 and qualified as a Barrister at Gray's Inn, London in 2006 while working in IBM in London. Tricia is focussed on the the governance, transparency, accountability and transfer management aspects of the GDPR including drafting and providing advice and guidance on Data Processing Agreements, Data Protection Statements, DPIAs, LIAs, Intercompany agreements and internal policies and procedures.

Panel: Legal hot topics in web data extraction

There have been some key updates to the legal framework of web scraping in 2020- 2021. In this panel, Legal Counsel at Zyte, Kate O’Brien, brings together a panel of legal experts in the field of data extraction to discuss the various aspects of web scraping compliance and updates to the legal landscape. Discussion topics include legal implications of accepting terms of service on a target website, updates and insights from the US Supreme Court Van Buren v. United States opinion, and how to be compliant with the General Data Protection Regulation (GDPR).

Victoria Vlahoyiannis

Legal Counsel at Zyte

Victoria is a Legal Counsel at Zyte and a CIPP/E certified privacy professional. Victoria works closely with the sales team and engineers to review the legal implications for data extraction projects such as accepting terms and conditions or data protection law compliance. Victoria also works with the applicable teams to evaluate compliance and risk assessment when implementing internal policies.

Adlede's story of contextual advertising: Content centric advertisement without tracking users' personal data

The presentation will focus on what Adlede does with the extracted data. During the presentation Kabir Fahria will present real life experiences and challenges of Adlede in terms of contextual advertisement in a case study manner. He will discuss what context means for Adlede, how do they define context from the gathered data, and how that fits into the advertisement domain. He will also draw examples of some projects that the company completed in collaboration with certain companies.

Kabir Fahria

Developer Advocate at Adlede

Kabir has been a part of Adlede for more than 2 years. Kabir is part of the development team that is responsible for developing and maintaining Adlede processes such as: categorizing contextual data and matching advertisements for media contents, overall NLP processes/algorithms, integrations with Demand Side Platform and Supply Side platforms in the advertisement industry, and building a cloud-based infrastructure on AWS and Digital Ocean.

AMA: Your web scraping questions answered

Join Kevin Lloyd Bernal, Theresia Tanzil, Peng-Yu Chen, and Nikita Vostretsov for an AMA (Ask Me Anything) session where they will answer any technical questions you have about web data extraction. These panelists are experts in the fields of Data science, Reverse Engineering, Solution architecture, and Anti-bans. So bring forth all your technical queries!
The star-studded panel will be hosted by Suzanne Hassett, COO and Head of Delivery at Zyte.

Kevin Lloyd Bernal

Technical Team Lead at Zyte

Kevin is currently a Technical Team Lead at Zyte, leading teams on multiple projects, designing large-scale crawls and ingestion systems. He is obsessed with understanding what lies beyond the data
Kevin is taking a Masters degree in CS specializing in ML and mechanical keyboard that goes clickety clack.

Applying design-thinking to build no-nonsense, effective dashboards for data visualization

In this session, Abhijith HK will deep dive into high-scale web scraping. He will share his experiences applying design-thinking to build no-nonsense, effective dashboards for data visualization, using ReactJS, D3JS, Bootstrap, ReactCharts, etcetera. Plus, he will share scalable hacks to build spiders using Scrapy, Smart Proxy Manager (formerly Crawlera), Splash, AWS, Redis, DynamoDB, GitLab, Dockers, Python, JS and others.

Abhijith HK

Founder and CEO at Codewave

Abhijith is a technology director helping businesses thrive in an age of design, tech, and agile hyperness.
He's helped 100+ enterprises architect & run 'speed at scale’ technological solutions for 10 years now.
In 2013, Abhijith, along with his partner Vidhya, launched 'Codewave' — a design-led digital transformation company that also designed its own culture and together, they've built 300+ digital solutions and helped 100s of businesses with varying levels of digital integration— aspiring, advancing, plateaued. Abhijith is a science nerd, loves playing guitar and PS3, and helps businesses cut through the noise & get to the signal.

AMA: Your web scraping questions answered

Join Nikita Vostretsov, Theresia Tanzil, Peng-Yu Chen, and Kevin Lloyd Bernal for an AMA (Ask Me Anything) session where they will answer any technical questions you have about web data extraction. These panelists are experts in the fields of Data science, Reverse Engineering, Solution architecture, and Anti-bans. So bring forth all your technical queries!
The star-studded panel will be hosted by Suzanne Hassett, COO and Head of Delivery at Zyte.

Nikita Vostretsov

Data Scientist at Zyte

Nikita is a member of Data Science team at Zyte. His contributions focus on training machine learning models powering the Zyte Automatic Extraction API. He works closely with the support team and monitors model performance to ensure the quality of web data extraction.

Panel: Legal hot topics in web data extraction

There have been some key updates to the legal framework of web scraping in 2020- 2021. In this panel, Legal Counsel at Zyte, Kate O’Brien, brings together a panel of legal experts in the field of data extraction to discuss the various aspects of web scraping compliance and updates to the legal landscape. Discussion topics include legal implications of accepting terms of service on a target website, updates and insights from the US Supreme Court Van Buren v. United States opinion, and how to be compliant with the General Data Protection Regulation (GDPR).

Hosted by Kate O'Brien

Legal Counsel at Zyte

Kate is Legal Counsel at Zyte and has extensive experience in the wide range of legal issues affecting web data extraction. At Zyte, she focuses on ensuring compliance on all data extraction projects, creating best practices based on current law and case law affecting web data extraction and educating colleagues on best practices.

Panel: Legal hot topics in web data extraction

There have been some key updates to the legal framework of web scraping in 2020- 2021. In this panel, Legal Counsel at Zyte, Kate O’Brien, brings together a panel of legal experts in the field of data extraction to discuss the various aspects of web scraping compliance and updates to the legal landscape. Discussion topics include legal implications of accepting terms of service on a target website, updates and insights from the US Supreme Court Van Buren v. United States opinion, and how to be compliant with the General Data Protection Regulation (GDPR).

Nina Fletcher

General Counsel at YipitData

Nina Fletcher is the General Counsel at YipitData, an alternative data provider based in New York City. YipitData is the on-demand, 100+ person data team for hundreds of the largest hedge funds, mutual funds, pension funds, private equity funds, family offices, sovereign wealth funds, and venture capital funds in the world. YipitData identifies, screens, licenses, cleans, and analyzes alternative data to help investors answer their key questions.
Prior to joining YipitData, Nina served as Senior Counsel and Chief Governance Officer at Bridgewater Associates, LP, a global macro hedge fund based in Connecticut, and in the New York offices of Sullivan & Cromwell, LLP and Covington & Burling, LLP, focusing on mergers & acquisitions, securities law, intellectual property and private equity.
Nina graduated from Yale University with a B.A., Columbia University School of Law with a J.D. and the London School of Economics & Political Science with an LL.M.

Scraping financial data: A practitioner's experience

Linus Nilsson is operating a hedge fund database, where the majority of the data comes from public disclosure. Disclosures that he scrapes from various shapes and forms. He will give you an introduction to his project where he runs a number of import routines and cleaning strategies to reduce the dimensionality of the data and to ensure high quality inputs.

Linus Nilsson

Founder of NilssonHedge

Linus Nilsson founded NilssonHedge, a public hedge fund database, as an initiative to bring transparency to the hedge fund universe. The database combines an innovative way of aggregating public performance data with free access to hedge fund returns. Mr. Nilsson is an experienced allocator who has spent close to twenty years investing in Hedge Funds (both privately and for institutions) or operating systematic strategies, as a startup but also as the CIO of a smaller European CTA.

Review and evolution of anti-bots

Today, many sites use various security measures to complicate the collection of data. Evgeny will talk about what methods and technologies are used in such systems
to distinguish bots from real users. He will review the evolution of anti-bot systems, what they paid attention to before and what awaits us in the future.

Evgeny Slaikovsky

Principal Reverse Engineer at Zyte

Evgeny is a principal reverse engineer at Zyte. For the past 2 years, he has been deeply involved in overcoming anti-ban issues. He enables the engineers at Zyte to scrape data from protected websites.

Everything you always wanted to know about headless browsers but were afraid to ask

In this session, Paweł Miech will take a deep dive into headless browser. His talk will be a guide to when and how to use headless browsers, which ones would be better in certain situations and the typical problems faced while using them. Paweł will also draw a comparison of the different types of headless browsers like Splash, Puppeteer, Playwright, Selenium other tools.

Paweł Miech

Technical Team Lead in Delivery the Department at Zyte

Paweł is a Technical Team Lead in Delivery Department at Zyte. He has several years of experience developing advanced crawling solutions using Scrapy framework. He loves contributing to open source. Paweł is one of the authors of ScrapyRT framework and has made contributions to Splash.

Scrape and graph your way to conference glory: Building the ultimate Call for Papers tool

As a Developer Advocate, a common task is to submit talks to conferences, as well as encouraging others in the community to do the same. One of the biggest challenges behind this objective is finding which are the right conferences. With many different Call for Papers (CfP) hosting platforms, it is a time intensive process. Combining web scraping approaches as well as a graph database is a great way to solve this exact problem. in this session, Ljubica is going to show you how she did exactly that. She will provide an
overview of graph databases, what they are and how they are different to other data stores. She will walk you through how she used a combination of select keywords, Beautiful Soup and Google search on popular Call for Papers websites to scrape relevant data and demonstrate how she can query this data and find
recommendations for what conferences apply next.

Ljubica Lazarevic

Developer Advocate at Neo4j

Ljubica Lazarevic is an experienced IT practitioner with a background in development, architecture, consulting and leadership. She's currently working as a Developer Advocate at Neo4j, a leading graph database platform that drives innovation and competitive advantage.

Alternative data demand & supply factors: Growth drivers to the overall marketplace

Niall Hurley, CEO of Eagle Alpha, will give us an overview of what Alternative Data is, what are the current trends in demand and supply, what are the challenges and how do people work with this data.

Nial Hurley

CEO at Eagle Alpha

Management team member and leader with a focus on corporate development and growth. I combine extensive experience of capital markets and corporates with a focus on data and teamwork to arrive at client focused solutions.

Taming the World Wide Web: Challenges faced when dealing with 100k+ websites

The web is a messy place filled with websites that range from the unorganized/cluttered and last modified in 1990 to the highly organized/efficient websites made by professional teams of programmers. We faced many challenges in dealing with this degree of variability and have learned a few tricks and many challenges in collecting and processing this much information. This talk will walk you through the journey from concept to iteration to designing your next large scale data collection project.

Eric Platow

Senior Architect at LexisNexis

Eric is a Senior Architect at Lexis Nexis. His goal is to build scalable, reusable components that come together in order to make our data stewards become super heroes. He is leading a cross functional team of data scientists, data engineers, software engineers and the data steward team through a transformation from a fully manual data collection to automated data collection and processing.

When to Flip the Table: SQL to NoSQL to NewSQL

Do you feel overwhelmed by the plethora of database options these days? What should stay in an RDBMS and what can go into no-SQL data lake? What data structure resides most efficiently within which database type? How does SQL, noSQL, and newSQL intersect? How does one work with multiple databases in parallel? Which is right for my use case? This talk, geared towards those curious about data structure and storage, relational databases and noSQL, will answer these questions and more as we explore the pros and cons related to the three major database types available today.

Rain Leander

Technical Evangelist at Cockroach Labs

Rain Leander is a systematic, slightly psychic, interdisciplinary community liaison with a Bachelor’s in dance and a Master’s in IT. An epic public speaker, they have disappeared within a box stuffed with swords, created life, and went skydiving with the Queen. Seriously. Rain is an active technical contributor with OpenStack, RDO, TripleO, Fedora, and DjangoGirls. Come say hello. Bring cake.

Check out the 2021 Videos!

2021 Speakers

How do we measure extraction quality

Konstantin Lopukhin

Need to scrape 20 websites in 3 hours?

Mikhail Korobov

AMA: Your web scraping questions answered

Hosted by Suzanne Hassett

Workshop: Building a real estate market monitoring tool with Scrapy, TimescaleDB and Apache Superset

Attila Tóth

AMA: Your web scraping questions answered

Peng-Yu Chen

Adaptive learning with PandioML

Joshua Odmark

AMA: Your web scraping questions answered

Theresia Tanzil

From code to data: Live-coding a small blog scraper

Jônatas Davi Paganini

Panel: Legal hot topics in web data extraction

Tricia Higgins

Panel: Legal hot topics in web data extraction

Victoria Vlahoyiannis

Adlede's story of contextual advertising: Content centric advertisement without tracking users' personal data

Kabir Fahria

AMA: Your web scraping questions answered

Kevin Lloyd Bernal

Applying design-thinking to build no-nonsense, effective dashboards for data visualization

Abhijith HK

AMA: Your web scraping questions answered

Nikita Vostretsov

Panel: Legal hot topics in web data extraction

Hosted by Kate O'Brien

Panel: Legal hot topics in web data extraction

Nina Fletcher

Scraping financial data: A practitioner's experience

Linus Nilsson

Review and evolution of anti-bots

Evgeny Slaikovsky

Everything you always wanted to know about headless browsers but were afraid to ask

Paweł Miech

Scrape and graph your way to conference glory: Building the ultimate Call for Papers tool

Ljubica Lazarevic

Alternative data demand & supply factors: Growth drivers to the overall marketplace

Nial Hurley

Taming the World Wide Web: Challenges faced when dealing with 100k+ websites

Eric Platow

When to Flip the Table: SQL to NoSQL to NewSQL

Rain Leander