科技和互联网

最容易的网络爬虫

如果懂一点Python,那么你就可以使用这个最容易的网络爬虫——Pandas来帮你取得网页的纯数据(RAW DATA),不需要其他的什么八爪鱼之类的复杂商业工具,并且,使用Pandas的过程也是一个提高自己Python技术的好机会。

Pandas也许是简单到极点的一个爬虫工具,读网页数据只要简单地调用read_html()函数就行了,它返回的是Dataframe对象,并且很容易就能转换为Json。最后,把数据导出到CSV文件,就大功告成了。

原文链接:https://medium.com/

抄录:

Quick Tip: The easiest way to grab data out of a web page in Python

Adam Geitgey

Let’s say you are searching the web for some raw data you need for a project and you stumble across a webpage like this:

You found exactly what you need — an up-to-date page with exactly the data you need!

But the bad news is that the data lives inside a web page and there’s no API that you can use to grab the raw data. So now you have to waste 30 minutes throwing together a crappy script to scrape the data. It’s not hard, but it’s a waste of time that you could spend on something useful. And somehow 30 minutes always ends up being 2 hours.

For me, this kind of thing happens all the time.

Luckily, there’s a super simple answer. The Pandas library has a built-in method to scrape tabular data from html pages called read_html():

It’s that simple! Pandas will find any significant html tables on the page and return each one as a new DataFrame object.

To upgrade our program from toy to real, let’s tell Pandas that row 0 of the table has column headers and ask it to convert text-based dates into time objects:

Which gives you this beautiful output:

And how that the data lives in a DataFrame, the world is yours. Wish the data was available as json records? That’s just one more line of code!

If you run that, you’ll get this beautiful json output (even with proper ISO 8601 date formatting!):

You can even save the data right to a CSV or XLS file:

Run that and double-click on calls.csv to open it up in your spreadsheet app:

And of course Pandas makes it simple to filter, sort or process the data further:

None of this is rocket science or anything, but I use it so often that I thought it was worth sharing. Have fun!


Thanks for reading! If you are interested in machine learning (or just want to understand what it is), check out my Machine Learning is Fun! series too.

 

Tagged ,

我忍不住要留言