How We Indexed Millions of Compounds by Element

The Technical Challenge

When we set out to build CompoundLookup, we faced a unique database challenge: index compounds by their constituent elements, not by their formulas.

Traditional chemistry databases index by:

Compound name
Molecular formula (exact match)
CAS number
InChI key

We needed to index by: which elements are present.

The Data Pipeline

Step 1: Data Extraction

We download compound data from PubChem's public FTP servers. This includes millions of compounds with their formulas, names, and properties.

Step 2: Formula Parsing

Each formula (like "C₆H₁₂O₆") is parsed to extract unique elements:

Input: "C6H12O6"
Output: [C, H, O]

This sounds simple but has edge cases:

Two-letter symbols (Fe, Na, Cl)
Parentheses in formulas
Ions and charges
Isotope notations

Step 3: Element Indexing

For each compound, we store relationships between the compound and its elements. This creates a many-to-many relationship enabling queries like:

Find all compounds WHERE elements CONTAIN [C, H, O]

Step 4: Subset Generation

To enable partial matching (find C-H compounds even if they also contain O), we pre-generate common element subsets for faster queries.

Query Optimization

When you select elements, we run optimized SQL:

SELECT compounds.* FROM compounds
WHERE EXISTS (element = 'C')
  AND EXISTS (element = 'H')
  AND EXISTS (element = 'O')

Indexes on element columns make this fast even with millions of compounds.

The Result

Our database enables something no other chemistry database can do: instant compound lookup by any element combination. This technology powers every search on CompoundLookup, delivering results in milliseconds.

Try Element-Based Search