Not Scraping Real Estate.com.au
I’d like to automate my monitoring of local real estate. Just a private resource, I’m not trying to re-create a site such as http://house.ksou.cn/. I’d like to implement a process to automate and do efficient market analysis, something like this:
Scraping
A few years ago was to easy to scrape https://www.realestate.com.au. I tried to revive an old script that had worked well 5 years ago. It looked a little like this:
require 'mechanize'
a = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
postcodes = %w{1234 2345}
%w{property-land with-4-bedrooms}.each do | type |
postcodes.each do | postcode |
a.get("http://www.realestate.com.au/buy/#{type}-in-#{postcode}/list-1?preferredState=sa") do |page|
next_urls = []
page_urls = []
get_links(page, next_urls)
page.links.each do |link|
if link.text =~ /^\d+$/
page_urls.push(link.href)
end
end
page_urls.uniq.each do | url |
a.get(url) do |pagex|
get_links(pagex, next_urls)
end
end
...
end
end
end
That results in error 429. It makes sense, as we should not be doing it, and they will try and stop it it. I was actually suprised it worked so well a few years ago.
I assumed I need an in browser solution, so I tried porting it to node.js and Apify. The code looks something like this:
const Apify = require('apify');
Apify.main(async () => {
const requestQueueClear = await Apify.openRequestQueue();
await requestQueueClear.drop();
const requestQueue = await Apify.openRequestQueue();
search_url = `https://www.realestate.com.au/buy/property-${options.type}-in-sa+${options.postcode}/list-1?source=refinement`
console.log(`Search ${search_url}`)
await requestQueue.addRequest({ url: search_url});
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
handlePageFunction: async ({ request, page }) => {
const title = await page.title();
console.log(`Title of ${request.url}: ${title}`);
},
});
await crawler.run();
});
That also results in error 429.
Man in the Middle Proxy
Looking at the process again, something like this might work:
The code to implement it looks like this:
from mitmproxy import ctx
...
PROPERTY_URL_RE=re.compile(r'/property-(house|land|unit|villa|townhouse)-(\w+)-([\w\+]+?)-(\d+)')
SEARCH_URL_RE=re.compile(r'/buy/(.*?)in-(.*?)/list')
class RELogger:
def __init__(self):
self._property_db = property_sql.PropertySql()
def response(self, flow):
if "realestate.com.au" in flow.request.host:
self.num += 1
property_match = PROPERTY_URL_RE.match(flow.request.path)
if property_match:
ptype = property_match.group(1)
pstate = property_match.group(2)
psuburb = property_match.group(3)
pid = property_match.group(4)
if flow.response is not None:
if flow.response.content:
record = property_html.PropertyHTML(flow.response.content)
record.suburb = psuburb
self._property_db.insert_record(pid, record)
else:
ctx.log.warn(f"Could not find property response CONTENT RE Path")
else:
ctx.log.warn(f"Could not find property response RE Path: {flow.response}")
else:
search_match = SEARCH_URL_RE.match(flow.request.path)
if search_match:
# TODO
pass
else:
ctx.log.warn(f"Could not interpret RE Path {flow.request.path}")
addons = [
RELogger()
]
Let’s see if it is useful.
Subscribe via RSS