How We Indexed Millions of Compounds by Element
A technical look at how we built the database powering CompoundLookup's unique element-based search.
The Technical Challenge
When we set out to build CompoundLookup, we faced a unique database challenge: index compounds by their constituent elements, not by their formulas.
Traditional chemistry databases index by:
- Compound name
- Molecular formula (exact match)
- CAS number
- InChI key
We needed to index by: which elements are present.
The Data Pipeline
Step 1: Data Extraction
We download compound data from PubChem's public FTP servers. This includes millions of compounds with their formulas, names, and properties.
Step 2: Formula Parsing
Each formula (like "C₆H₁₂O₆") is parsed to extract unique elements:
- Input: "C6H12O6"
- Output: [C, H, O]
This sounds simple but has edge cases:
- Two-letter symbols (Fe, Na, Cl)
- Parentheses in formulas
- Ions and charges
- Isotope notations
Step 3: Element Indexing
For each compound, we store relationships between the compound and its elements. This creates a many-to-many relationship enabling queries like:
Find all compounds WHERE elements CONTAIN [C, H, O]
Step 4: Subset Generation
To enable partial matching (find C-H compounds even if they also contain O), we pre-generate common element subsets for faster queries.
Query Optimization
When you select elements, we run optimized SQL:
SELECT compounds.* FROM compounds WHERE EXISTS (element = 'C') AND EXISTS (element = 'H') AND EXISTS (element = 'O')
Indexes on element columns make this fast even with millions of compounds.
The Result
Our database enables something no other chemistry database can do: instant compound lookup by any element combination. This technology powers every search on CompoundLookup, delivering results in milliseconds.
Try Element-Based Search