CompoundLookup Search compounds by elements
TECHNICAL

How We Indexed Millions of Compounds by Element

A technical look at how we built the database powering CompoundLookup's unique element-based search.

6 min read | January 28, 2026

The Technical Challenge

When we set out to build CompoundLookup, we faced a unique database challenge: index compounds by their constituent elements, not by their formulas.

Traditional chemistry databases index by:

  • Compound name
  • Molecular formula (exact match)
  • CAS number
  • InChI key

We needed to index by: which elements are present.

The Data Pipeline

Step 1: Data Extraction

We download compound data from PubChem's public FTP servers. This includes millions of compounds with their formulas, names, and properties.

Step 2: Formula Parsing

Each formula (like "C₆H₁₂O₆") is parsed to extract unique elements:

  • Input: "C6H12O6"
  • Output: [C, H, O]

This sounds simple but has edge cases:

  • Two-letter symbols (Fe, Na, Cl)
  • Parentheses in formulas
  • Ions and charges
  • Isotope notations

Step 3: Element Indexing

For each compound, we store relationships between the compound and its elements. This creates a many-to-many relationship enabling queries like:

Find all compounds WHERE elements CONTAIN [C, H, O]

Step 4: Subset Generation

To enable partial matching (find C-H compounds even if they also contain O), we pre-generate common element subsets for faster queries.

Query Optimization

When you select elements, we run optimized SQL:

SELECT compounds.* FROM compounds
WHERE EXISTS (element = 'C')
  AND EXISTS (element = 'H')
  AND EXISTS (element = 'O')

Indexes on element columns make this fast even with millions of compounds.

The Result

Our database enables something no other chemistry database can do: instant compound lookup by any element combination. This technology powers every search on CompoundLookup, delivering results in milliseconds.

Try Element-Based Search