I’d like to automate my monitoring of local real estate. Just a private resource, I’m not trying to re-create a site such as http://house.ksou.cn/. I’d like to implement a process to automate and do efficient market analysis, something like this:

Scraping

Scraping

A few years ago was to easy to scrape https://www.realestate.com.au. I tried to revive an old script that had worked well 5 years ago. It looked a little like this:

require 'mechanize'

a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

postcodes = %w{1234 2345}

%w{property-land with-4-bedrooms}.each do | type |

  postcodes.each do | postcode |
    a.get("http://www.realestate.com.au/buy/#{type}-in-#{postcode}/list-1?preferredState=sa") do |page|

      next_urls = []
      page_urls = []
      get_links(page, next_urls)
      page.links.each do |link|
        if link.text =~ /^\d+$/
          page_urls.push(link.href)
        end
      end
      page_urls.uniq.each do | url |
        a.get(url) do |pagex|
          get_links(pagex, next_urls)
        end
      end

      ...
    end
  end 
end

That results in error 429. It makes sense, as we should not be doing it, and they will try and stop it it. I was actually suprised it worked so well a few years ago.

I assumed I need an in browser solution, so I tried porting it to node.js and Apify. The code looks something like this:

const Apify = require('apify');

Apify.main(async () => {
    const requestQueueClear = await Apify.openRequestQueue();
    await requestQueueClear.drop();
    const requestQueue = await Apify.openRequestQueue();

    search_url = `https://www.realestate.com.au/buy/property-${options.type}-in-sa+${options.postcode}/list-1?source=refinement`
    console.log(`Search ${search_url}`)
    await requestQueue.addRequest({ url: search_url});

    const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        handlePageFunction: async ({ request, page }) => {
            const title = await page.title();
            console.log(`Title of ${request.url}: ${title}`);
        },
    });

    await crawler.run();
});

That also results in error 429.

Man in the Middle Proxy

Looking at the process again, something like this might work:

Proxy

The code to implement it looks like this:

from mitmproxy import ctx

...

PROPERTY_URL_RE=re.compile(r'/property-(house|land|unit|villa|townhouse)-(\w+)-([\w\+]+?)-(\d+)')
SEARCH_URL_RE=re.compile(r'/buy/(.*?)in-(.*?)/list')

class RELogger:
    def __init__(self):
        self._property_db = property_sql.PropertySql()

    def response(self, flow):
        if "realestate.com.au" in flow.request.host:
            self.num += 1
            property_match = PROPERTY_URL_RE.match(flow.request.path)
            if property_match:
                ptype = property_match.group(1)
                pstate = property_match.group(2)
                psuburb = property_match.group(3)
                pid = property_match.group(4)
                if flow.response is not None:
                    if flow.response.content:
                        record = property_html.PropertyHTML(flow.response.content)
                        record.suburb = psuburb
                        self._property_db.insert_record(pid, record)
                    else:
                        ctx.log.warn(f"Could not find property response CONTENT RE Path")
                else:
                    ctx.log.warn(f"Could not find property response RE Path: {flow.response}")
            else:
                search_match = SEARCH_URL_RE.match(flow.request.path)
                if search_match:
                    # TODO
                    pass
                else:
                    ctx.log.warn(f"Could not interpret RE Path {flow.request.path}")

addons = [
    RELogger()
]

Let’s see if it is useful.