Fonduer: Knowledge Base Construction from Richly Formatted Data

Knowledge bases are incredible enablers of valuable downstream applications such as information retrieval, question answering, medical diagnosis, and data visualization. However, building high quality knowledge bases can be incredibly difficult. While extensive efforts have been focused on unstructured text, troves of information remains untapped in richly formatted data, where relations are conveyed using textual, structural, tabular, and visual cues.

We recently built Fonduer, a knowledge base construction framework for richly formatted information extraction. Fonduer is the first knowledge base construction system for richly formatted data, and uses a new unified data model, which preserves structural and semantic information across different data modalities, and a human-in-the-loop paradigm called data programming to train machine learning systems.

Why is it called Fonduer?

We think our system for extracting information from richly formatted data resembles some of the characteristics of rich and savory fondue. Specifically, there are some analogies with the challenges of richly formatted data that Fonduer seeks to address.

Prevalent Document-level Relations

document level relations
Fig. 1: With richly formatted data, we need to look at the whole picture.

The first challenge is the prevalence of document-level relations. In order to extract information from a PDF document, for example, we typically cannot look at the context of just a single sentence. If we limit the context to a single sentence or table, we can miss up to 97% of the relations in the document! Instead, we need to step back and consider the document as a whole to in order to appreciate and capture all of the rich information contained within.

Multimodality

multimodality
Fig. 2: We need to consider signals from multiple data modalities together, not in isolation.

The second challenge is multimodality. Just as fondue is made up of a variety of ingredients, each with their own flavor and textures that come together to make a meal, richly formatted documents rely on a variety of data modalities to convey information. For example, bold text, placement on a page, and visual alignment in a table column all convey meaning. Fonduer captures textual, structural, tabular, and visual information in a unified data model.

Data Variety

data variety
Fig. 3: There is a huge variety in the types of richly formatted data.

The third challenge is data variety. Fondue isn’t just bread and cheese; it could be meat and oil, or even chocolate and fruit! Similarly, there is a huge amount of variety in richly formatted documents. This can come from format variety (e.g., different file formats) and stylistic variety (e.g., linguistic variation or differences in table formatting). Fonduer adopts a data model that is generalizable and robust against heterogeneous input data.

Learn More

Read about it in the HazyResarch blog post, or view the full paper.

Posts from blogs I follow

The IPv6 situation on Docker is good now!

Good news, everyone! Doing IPv6 networking stuff on Docker is actually good now! I’ve recently started reworking my home server setup to be more IPv6 compatible, and as part of that I learned that during the summer of 2024 Docker shipped an update that eli…

via ./techtipsy December 20, 2024

Good Reasons for Alts

I originally wrote this a year ago, but just now found it in my drafts. Not sure why I didn't post it then. One flavor of response I got with my post on deanonymizing accounts was roughly: Why not just go ahead and post the list of alts? It'…

via Jeff Kaufman's Writing December 20, 2024

Scaling Bluesky with Paul Frazee

Paul Frazee joins Bryan, Adam, and the Oxide Friends to talk about the inner workings of Bluesky and the AT Protocol. Paul and the Bluesky team have been working on decentralized systems for years and years--very cool to see both the next evolutionary step…

via Oxide and Friends December 19, 2024

Generated by openring-rs from my blogroll.