Extracting data from websites is a topic that doesn't get as much attention or research as things like identifying objects in pictures or recognizing names in text.
However, it's a fascinating and rewarding area to delve into. This is because websites can be viewed in so many different forms: as a snapshot of a page, as the text on the page, as the underlying HTML code, and more. This opens up a variety of creative methods for tackling the problem, often involving the combination of different types of data and ways of presenting that data within a single model. The recent emergence of large language models has introduced an entirely new approach to this task.
In this talk, Konstantin will:
• Learn about how we can think of web data extraction as a problem that machine learning can solve, and what sort of information we can use.
• Discover how Chat-GPT, a type of language model, can be used for this purpose, and understand its limitations.
• See how a sophisticated model can evolve from simple beginnings, learn handy techniques from related fields and studies, and review several modern methods for tackling this problem, like architectures based on transformers.