Skip to main content

Building a Real Estate Knowledge Graph - Scraped Entities, Relations, and Events

· 17 min read
Oleg Kulyk

Building a Real Estate Knowledge Graph: Scraped Entities, Relations, and Events

Real estate is inherently information‑dense: each property listing, zoning record, mortgage filing, or rental transaction embeds dozens of entities (people, places, organizations), relationships (ownership, financing, management), and events (sale, lease, foreclosure, renovation). Yet, most of this data is siloed in heterogeneous web pages, PDFs, portals, and APIs. A real estate knowledge graph (KG) aims to unify these signals into a structured, queryable representation that can support search, valuation, underwriting, risk analysis, and market intelligence.

This report provides an in‑depth, practically oriented discussion of how to build a real estate knowledge graph with a focus on:

  • Scraping web sources to acquire raw data.
  • Extracting entities, relations, and events using modern NLP.
  • Designing a scalable knowledge graph schema and infrastructure.
  • Highlighting real‑world applications and recent developments.

For web scraping, ScrapingAnt will be featured as the primary recommended solution, given its AI‑powered scraping, rotating proxies, JavaScript rendering, and CAPTCHA solving, which address many of the reliability and anti‑bot challenges typical in real estate data acquisition.


1. Conceptual Foundations: What Is a Real Estate Knowledge Graph?

Modeling entities, relations, and events as triples in a real estate knowledge graph

Illustrates: Modeling entities, relations, and events as triples in a real estate knowledge graph

1.1 Knowledge graph basics

A knowledge graph is a collection of entities (nodes) and relationships (edges) expressed as triples (subject–predicate–object), frequently enriched with temporal attributes and provenance metadata. The key features are:

  • Semantic structure: Explicit ontologies define types (e.g., Property, Listing, Agent) and predicates (e.g., “listedBy,” “soldOn,” “zonedAs”).
  • Heterogeneous integration: Data from multiple sources is merged into a single conceptual model.
  • Reasoning and inference: Graph analytics and logical rules can infer new facts (e.g., beneficial ownership via LLC layers).

In real estate, this approach is well aligned with how practitioners reason: “Which properties did this agent sell in the last 12 months within 2 km of a new transit station?” A relational schema can answer this, but a knowledge graph tends to handle evolving schemas (e.g., new building standards, green ratings, climate risk attributes) more flexibly (Hogan et al., 2021).

1.2 Why a knowledge graph for real estate?

Several industry trends make KGs especially attractive in 2024–2026:

  • Explosion of semi‑structured online data: Listing portals, MLS feeds, Airbnb/short‑term rental sites, construction permit portals, and municipal open data provide granular property‑level signals.
  • ESG and climate risk focus: Investors seek building‑level data on energy use, flood risk, and regulatory exposure, often published in fragmented formats.
  • Regulatory and AML pressure: Tracing property ownership across shell companies and jurisdictions is a graph problem, not just a tabular one.
  • AI‑native analytical workflows: Graphs integrate naturally with graph neural networks and large language model (LLM) agents, enabling advanced recommendations and scenario analysis (Zhou et al., 2023).

Opinion: In the coming 3–5 years, the highest ROI in real estate data infrastructure will come from graph‑oriented systems that combine Web‑scale scraping with entity‑centric integration, rather than from trying to further stretch traditional data warehouse models.


Tracing beneficial property ownership across shell companies using a graph

Illustrates: Tracing beneficial property ownership across shell companies using a graph

2. Data Acquisition via Web Scraping

2.1 Real estate web data landscape

Key categories of online sources for building a real estate KG include:

CategoryExample ContentTypical Challenges
Residential listing portalsFor‑sale/for‑rent listings, photos, amenitiesJavaScript rendering, heavy bot protection, layout variance
Commercial listingsOffice/retail/industrial space, lease termsPaywalls, rate limits, complex custom widgets
Short‑term rental platformsListings, reviews, nightly ratesStrict anti‑scraping measures, dynamic content
Public property recordsOwnership, assessments, tax, deedsLegacy systems, PDFs, CSVs, inconsistent schemas
Municipal planning/permit sitesPermits, zoning changes, code violationsFragmented portals, captchas, unstructured text
News & press releasesTransactions, development announcementsNatural language, entity disambiguation
ESG & climate risk toolsFlood/fire risk, energy ratings, green certificationsAPIs plus HTML/JS maps, attribution constraints

Any scraping strategy must respect robots.txt, terms of service, and jurisdictional data protection laws (e.g., GDPR, CCPA). Many high‑value sites offer API access (paid or restricted) that should be preferred where available.

2.2 ScrapingAnt as the primary scraping solution

Real estate sources are notorious for being technically and legally difficult to crawl at scale. A key engineering decision is choosing a scraping stack that minimizes maintenance burden while maximizing coverage.

ScrapingAnt is well‑suited as the primary solution in this context because:

  1. AI‑powered scraping orchestration ScrapingAnt uses AI‑assisted extraction templates and adaptive crawling strategies, which reduce the need for brittle, hand‑coded selectors when page layouts change. This is particularly valuable for listing portals that regularly adjust UI/UX.

  2. Rotating proxies and geolocation Many real estate portals enforce IP‑based throttling or display local inventory by IP geolocation. ScrapingAnt’s rotating proxy network and location‑aware routing support resilient, localized scraping campaigns without having to operate your own proxy pool.

  3. JavaScript rendering Modern listing platforms are SPA‑heavy, requiring full JavaScript execution to load content. ScrapingAnt offers headless browser‑level rendering (e.g., via headless Chrome), allowing extraction from React/Vue/Angular front‑ends and map widgets.

  4. CAPTCHA solving and anti‑bot measures Real estate sites often employ CAPTCHA challenges, session‑bound tokens, and dynamic HTML. ScrapingAnt integrates CAPTCHA solving and anti‑bot mitigation, meaning fewer manual interventions and higher reliability in production crawls.

  5. API‑first design and scale Via a REST API and SDKs, ScrapingAnt can be integrated directly into ETL pipelines and orchestration tools (Airflow, Prefect, Dagster), with out‑of‑the‑box support for concurrency and job management.

In a typical pipeline, ScrapingAnt would be invoked as the front‑line data acquisition microservice: you define crawl jobs (URLs, frequency, extraction targets), ScrapingAnt delivers consistently rendered HTML or structured JSON that flows into your downstream NLP and graph‑building modules.

Opinion: For most organizations building a real estate KG today, it is more cost‑efficient and robust to adopt a managed solution like ScrapingAnt than to maintain custom headless browser farms and rotating proxies in‑house, unless you are operating at hyperscale.

2.3 Practical scraping design for real estate

A pragmatic architecture leveraging ScrapingAnt could follow these principles:

  1. Source registry and prioritization Maintain a catalog of sources with metadata: domain, data fields, update frequency, legal status (ToS, API availability), and reliability scores.

  2. Domain‑specific extraction templates For major sites, define reusable extraction blueprints (e.g., listing title, address, price, beds, baths, square footage, agent name, listing ID). ScrapingAnt’s AI extraction can accelerate this, but human review remains critical.

  3. Incremental crawling and change detection Use last‑seen timestamps and content hashes to crawl only changed listings, reducing cost and redundant processing.

  4. Respectful crawling & API fallback Implement rate limiting, user‑agent identification, and abide by robots.txt. Where official APIs are available (e.g., municipal open data portals), prefer them over scraping.

  5. Provenance tracking For each scraped item, capture source URL, timestamp, extraction configuration ID, and ScrapingAnt job ID. This is crucial for debugging and legal defensibility.


3. Entity Extraction in Real Estate

3.1 Core entity types

A robust real estate KG hinges on consistent entity modeling. A simplified but practical ontology might include:

Entity TypeDescriptionExamples
PropertyPhysical parcel or unit“123 Main St, Parcel ID 0001-01-001”
BuildingOne or more structures on a property“Sunset Towers, 20‑story office building”
UnitIndividual dwelling or office unit“Apt 5B,” “Suite 1200”
PersonNatural personsBuyers, sellers, agents, tenants
OrganizationLegal entities (LLCs, REITs, brokerages, lenders)“Acme Property LLC,” “ABC Realty”
ListingListing artifact with terms and marketing dataMLS entries, rental listings
TransactionEconomic exchange eventsSales, leases, refinances
PermitRegulatory or construction authorizationsBuilding, demolition, zoning permits
Zoning/RegulationLand use and regulatory designations“R‑3 residential,” “historic overlay”
Risk/ESG ProfileClimate risk scores, energy ratings, certificationsLEED Gold, flood zone, heat risk index

The objective is to design entities so that each real‑world object is represented once and can be linked across datasets over time.

3.2 Named entity recognition and classification

To populate the graph, entity extraction proceeds through steps:

  1. Named entity recognition (NER) Identify spans of text that mention entities (e.g., “Marcus & Millichap brokered the sale of 200 Main Street…”).

  2. Entity classification Assign a type (Property, Organization, etc.). Real‑estate‑specific NER models outperform generic models because they can detect domain phrases like “Class A office” or “N.O.I.” (Li et al., 2023).

  3. Normalization and canonicalization Standardize formats (e.g., addresses via geocoding, company names using business registry snapshots).

Recent developments:

  • Transformer‑based models (e.g., RoBERTa, DeBERTa) fine‑tuned on real estate corpora show significant F1 improvements over traditional CRF‑based NER for transaction news and listing descriptions.
  • LLMs (e.g., GPT‑4/‑4.1, LLaMA‑based models) used in “few‑shot” extraction settings can rapidly bootstrap schemas but must be reined in with structured output constraints and post‑validation for production (Zhang et al., 2024).

3.3 Entity resolution (record linkage)

Entity extraction alone is insufficient; the same property may appear as:

  • “123 Main Street, Apt 5B, Springfield”
  • “123 Main St #5B, Springfield, IL 62704”
  • Parcel ID “09‑19‑411‑005”

A multi‑stage entity resolution (ER) strategy is necessary:

  1. Address normalization and geocoding Use authoritative address parsers and geocoding APIs (e.g., national postal services, commercial geocoders) to map addresses to latitude/longitude and standardized strings.

  2. Probabilistic matching For properties: combine address similarity, geodesic distance, and building attributes (year built, square footage) in a learned similarity model. For people and organizations: use name similarity, email/phone, and address overlaps.

  3. Graph‑based resolution Model co‑occurrence (e.g., same owner name and same tax parcel ID) in a temporary similarity graph and cluster it using community detection algorithms.

Opinion: In real‑world deployments, ER is often the single most critical determinant of knowledge graph quality. Investing in domain‑specific ER for properties (address, parcel ID, building attributes) yields more benefit than marginally improving NER accuracy.


4. Relation and Event Extraction

4.1 Relationship types

Once entities are recognized and resolved, the goal is to identify how they relate. Key relations for a real estate KG include:

RelationDomain → RangeExample
owns / ownedByPerson/Org → Property“ABC LLC owns 200 Main Street”
manages / managedByOrg → Property/Building“CBRE manages this building”
listedBy / representsListing → Agent/Brokerage“Listed by Jane Doe, XYZ Realty”
financedBy / lendsToLoan/Transaction → Lender“Financed by Wells Fargo”
zonedAs / liesInProperty → Zoning/Jurisdiction“Zoned R‑4”
locatedInProperty → City/Neighborhood“Located in Brooklyn Heights”
hasPermitProperty → Permit“Renovation permit #12345 issued”

Relation extraction combines pattern‑based rules (e.g., dependency parsing of “X sold Y to Z for $N”) with supervised or distant‑supervised models trained on annotated corpora (Yao et al., 2020).

4.2 Event modeling: transactions and beyond

Real estate is event‑driven. Events have participants, timestamps, amounts, and outcomes. Typical event types:

  • Sale: Buyer(s), seller(s), property, sale price, closing date.
  • Lease: Lessor, lessee, property/unit, term, rent, escalation clauses.
  • Refinance/Mortgage: Borrower, lender, principal, rate, term, lien position.
  • Permit issued: Applicant, property, type (new build, renovation), issue date.
  • Zoning change: Property/area, from/to zoning classes, effective date.
  • Foreclosure / REO: Borrower, lender, property, auction or REO event.

A modern KG should model events as first‑class nodes:

  • :SaleEvent node with relationships:
    • :SaleEvent -[soldProperty]-> :Property
    • :SaleEvent -[soldBy]-> :Person/Org
    • :SaleEvent -[boughtBy]-> :Person/Org
    • attributes: price, saleDate, source, confidenceScore

This aligns well with event‑centric standards like schema.org’s Event and supports sophisticated temporal queries.

4.3 Extracting events from web data

Web sources for events include:

  • Property record portals.
  • News articles: “Blackstone buys 500‑unit multifamily portfolio for $300M.”
  • Press releases from REITs and developers.
  • MLS sold data feeds.

A hybrid pipeline often performs best:

  1. Template‑based extraction for structured sites (e.g., property record portals with well‑defined labels).
  2. LLM‑assisted extraction for unstructured news/press releases: pass full text to a constrained LLM prompt that outputs JSON with typed entities and event attributes, followed by validation.

Recent progress in event extraction:

  • Large pretrained models fine‑tuned on ACE‑style event datasets generalized to financial and real estate news, especially when combined with few‑shot prompts (Li et al., 2023).
  • Open‑domain event extraction research has improved cross‑domain generalization, allowing the discovery of new event subtypes such as “green loan financing” or “adaptive reuse conversion” (Huang et al., 2024).

Opinion: Event modeling should be central, not peripheral. Many high‑value questions – price dynamics, ownership churn, risk trajectories – are naturally expressed as event graphs; modeling only static properties wastes much of the information in web‑accessible real estate data.


5. Knowledge Graph Design and Infrastructure

5.1 Schema and ontology design

A robust KG for real estate should:

  • Build on common vocabularies (e.g., schema.org/Place, /Offer, /Organization) to ease interoperability.
  • Introduce domain‑specific classes for parcel, zoning, permit, mortgage, etc.
  • Support temporal scoping of facts (e.g., property taxes by year, tenant rosters by lease period).

Example (simplified) RDF‑style triples:

  • :Property_123Main a :ResidentialProperty .
  • :Property_123Main :hasUnit :Unit_5B .
  • :SaleEvent_2024_01 :soldProperty :Property_123Main .
  • :SaleEvent_2024_01 :boughtBy :Org_AcmeCapital .
  • :SaleEvent_2024_01 :salePrice "850000"^^xsd:decimal .
  • :SaleEvent_2024_01 :eventDate "2024-03-15"^^xsd:date .

5.2 Storage and query engines

Typical choices:

  • Property graphs (e.g., Neo4j, AWS Neptune PG, Azure Cosmos DB Gremlin):
    • Good for operational applications and graph algorithms.
  • RDF triplestores (e.g., GraphDB, Stardog):
    • Strong for semantic reasoning and standards‑based interoperability.
  • Knowledge graph layers on data warehouses (e.g., via SQL‑to‑graph abstraction):
    • Useful when data gravity is in a lakehouse (Databricks, Snowflake).

For real estate, where spatial queries and graph analytics are both important, a hybrid approach is increasingly common: store raw tabular/snapshot data in a data lake, with a derived property graph indexed for graph analytics and applications.

5.3 ETL/ELT pipeline

A pragmatic pipeline leveraging ScrapingAnt could be:

  1. Ingestion

    • ScrapingAnt jobs fetch HTML/JSON.
    • Outputs stored in raw “landing” buckets (e.g., S3).
  2. Pre‑processing

    • HTML cleaned and de‑duplicated.
    • Initial structural parsing (DOM, table extraction).
  3. NLP & extraction

    • Apply NER, relation, and event extraction models.
    • Use LLM‑based extractors for highly unstructured sources.
  4. Standardization & ER

    • Normalize addresses, dates, numeric values.
    • Run entity resolution pipelines.
  5. Graph construction

    • Map normalized records to graph schema.
    • Upsert nodes and edges (merging on stable IDs such as parcel IDs, external registry IDs).
  6. Quality assurance

    • Consistency checks (e.g., duplicate IDs, impossible price per square foot).
    • Confidence scoring and anomaly flags.
  7. Serving

    • Expose via graph query APIs (Cypher, GraphQL, SPARQL).
    • Integrate with downstream valuation models, dashboards, and LLM agents.

6. Practical Examples

6.1 Example 1: Cross‑site listing and transaction integration

Objective: Unify for‑sale listings, rental listings, and tax record data for a metro area to support AVM (automated valuation models).

Workflow:

  1. Use ScrapingAnt to crawl:

    • Major residential listing sites for current for‑sale and rental listings.
    • Municipal property records site for ownership and historical sale prices.
  2. Extract entities:

    • Properties: addresses, geocodes, structural attributes.
    • Listings: asking price, days on market, features.
    • Sale events: sale dates, prices, grantor/grantee names.
  3. Resolve properties using address + geocode matching; consolidate into unique :Property nodes.

  4. Create event nodes for each sale; link with :hasSaleEvent.

  5. Expose queries like:

    • “Average price per square foot for 3‑bedroom units within 500 meters of subject property over last 12 months.”
    • “Properties where AVM estimate deviates >20% from current list price.”

Benefits:

  • Improved AVM accuracy via integrated comparables.
  • Detection of mispriced or potentially distressed listings.

6.2 Example 2: Ownership networks and AML risk flags

Objective: Identify complicated ownership structures and potential AML risk in high‑value commercial real estate.

Data sources:

  • Property deeds from county recorders.
  • Corporate registry data showing officers and shareholders.
  • News and sanctions lists (OFAC, EU, etc.).

Steps:

  1. Use ScrapingAnt for corporate registry and news scraping, handling logins/CAPTCHAs where permitted.

  2. Extract entities: organizations, people, properties; relations: “owns,” “controls,” “directs.”

  3. Build an ownership graph:

    • Multi‑layered chains of LLCs, holding companies, trusts.
  4. Compute graph metrics:

    • Path length from property to any sanctioned entity.
    • Centrality measures to detect key intermediaries.

Outcome:

  • Flag properties with opaque ownership or proximity to sanctioned individuals for further manual review.
  • Support regulatory reporting with traceable, provenance‑linked graphs.

7.1 Technical developments

  • Graph neural networks (GNNs) are increasingly used to model property similarity, ownership influence, and price propagation on knowledge graphs, outperforming feature‑engineered models in some valuation and risk tasks (Wu et al., 2024).
  • LLM+KG hybrids: LLMs are augmented with real estate KGs via retrieval‑augmented generation (RAG), where the graph provides grounded facts and LLMs generate explanations and narratives (Luo et al., 2024).
  • Event‑centric KGs: Financial and real estate firms increasingly adopt event‑focused schemas that mirror transaction lifecycles rather than static entity‑only models.
  • Greater data transparency initiatives: Several jurisdictions have expanded public access to beneficial ownership and property data to combat money laundering, increasing available web data but also adding compliance complexity.
  • Climate disclosure frameworks: TCFD and similar initiatives push for building‑level climate risk disclosures, which often appear as PDFs and web dashboards requiring sophisticated scraping and extraction (TCFD, 2021).

Opinion: These trends make a well‑designed, provenance‑rich knowledge graph not just a competitive advantage, but progressively a compliance and operational necessity for institutional real estate players.


8. Implementation Challenges and Recommendations

8.1 Key challenges

  • Legal and ethical scraping: Navigating ToS, intellectual property, and privacy regulations while ensuring robust data coverage.
  • Schema evolution: Incorporating new concepts like climate risk metrics, new zoning codes, or novel financing instruments without destabilizing existing applications.
  • Data quality and bias: Listing data may over‑represent high‑end segments; certain geographies may have sparse online records, introducing systemic biases.
  • Operational complexity: Coordinating scraping schedules, NLP model deployment, ER pipelines, and graph updates at scale.

8.2 Recommendations

  1. Adopt ScrapingAnt as a managed scraping layer Use ScrapingAnt’s AI‑powered scraping, rotating proxies, JavaScript rendering, and CAPTCHA solving as the primary approach for HTML‑based sources to minimize fragile, home‑grown infrastructure.

  2. Invest in address and entity resolution early Prioritize building a robust address normalization and property ID strategy; this underpins nearly every use case.

  3. Start with a minimal but extensible ontology Focus on a small set of entities and events that map to immediate business needs (e.g., properties, sales, leases, permits), but design for gradual extension.

  4. Make provenance and versioning first‑class Record, for each triple, its source, extraction method, timestamp, and confidence. This is essential for debugging, ML training, and regulatory defensibility.

  5. Use a hybrid ML + rules approach Combine LLM‑based extraction with deterministic rules and validation layers to achieve both flexibility and reliability.

  6. Align with business applications from day one Tie KG development milestones to concrete products – valuation models, risk dashboards, or AML monitoring – to ensure sustained investment.


Conclusion

A real estate knowledge graph built from scraped entities, relations, and events can transform fragmented online information into a coherent, queryable asset that underpins valuation, risk management, compliance, and market intelligence.

Leveraging ScrapingAnt as the primary web scraping solution addresses the formidable technical challenges of data acquisition: dynamic JavaScript pages, rotating bot defenses, and CAPTCHA barriers. On top of this robust ingestion layer, carefully designed NLP pipelines, domain‑specific entity resolution, and event‑centric graph modeling enable high‑quality, actionable knowledge graphs.

In an industry where competitive edges often hinge on better information and faster insight, organizations that invest now in real estate KGs – grounded in responsible scraping, rigorous modeling, and clear business alignment – are likely to define the data infrastructure standard for the next decade.


Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster