Analyzing Data

شهروز مطهری
پنج‌شنبه, 28 مرداد 1395

In this article, I start an exploration of data science using Python

My first Web-related job was in 1995, developing Web applications for a number of properties at Time Warner. When I first started there, we had a handful of programmers and managers handling all of the tasks. But over time, as happens in all growing companies and organizations, we started to specialize. One of the developers took charge of the logfiles—storing them and then performing some basic analysis on them.

Although I recognized that this work was important, it took years for me to realize that in some ways, his job was more important to the business than the applications that I was writing. The developer who worked on these logfiles, and who analyzed them for our bosses, made it possible to know who was using our system, what they were viewing and using, where they came from, and what correlations we could find among the different data points provided by the logs. Sure, we were providing the content and the applications that brought people to the site, but it was the person analyzing the logfiles who was ensuring that our work was paying for itself and meeting our business goals.

During the past decade, I've come to appreciate the need for such analysis even more, as the Web has exploded in popularity, as businesses have learned to use such data to increase profitability and as data science has become a growing field. We're now drowning in data, and being able to make sense of it using analytical tools and libraries is more important than ever.

In this article, I start an exploration of data science using Python, and how you can take something as ordinary as an Apache logfile and extract information from it to understand your visitors better and what they do. In upcoming articles, I plan to cover how you can use data science methods to analyze this logfile in a number of different ways, gaining insights into the raw data it provides and answering questions about your Web application. I'll describe how this analysis also can be presented to your managers and clients, providing powerful visualizations of the analysis you've performed.

Data Science and Python

I studied something called "learning sciences" in graduate school. While I was there, we often would joke that any discipline that includes the word "science" in its name is probably not a real science. Regardless of whether data science is a "real" science, it is a large, important and growing field—one that allows businesses to make decisions based on the data they have gathered. The more data, and the more intelligently you use that data, the better you'll be able to predict your users' and customers' wants and needs.

Data science has been defined loosely as the intersection of programming and statistics, applied to a particular domain. You gather some data and then use statistical methods to answer questions the data might be able to answer. A background in statistics can be helpful, not only because it'll show you the types of analysis you might want to apply, but also because it gives you a healthy sense of skepticism regarding the correlations you find. Did you really discover that your Web site is popular only with people in a particular area of the world? Or, did you just advertise it heavily in one part of the world, influencing who is more likely to visit?

You can start a data science project by asking a question, or you can start to explore the data in a variety of ways, hoping you will find an interesting correlation. Regardless, data science expects you to know a variety of methods from which you can choose one or more that are appropriate for answering your questions. You then apply the methods, using statistical tests to determine whether your answers are significant—that is, whether they merely could have been random.

Python, long used by system administrators, Web developers and researchers, is an increasingly popular choice among people working in data science. This is the result of several factors coming together. First, Python has a famously shallow learning curve, allowing non-programmers to get started and do things in a short amount of time.

Second, Python works easily and cleanly with a variety of data formats and databases. Thus, whether your raw data is in a text file, relational database, NoSQL database, CSV file, Excel file or something more unusual, the odds are very good that Python will be able to read from it easily and quickly.

Third, a number of libraries for analyzing data in Python, such as NumPy, SciPy and Matplotlib, have been under development for many years, providing a terrific balance of usability, expressive power and high-efficiency execution. In recent years, the Pandas library has added an even more useful layer on top of this.

Finally, the development of IPython, now known as Jupyter, has been nothing short of revolutionary, providing developers and data scientists with the ability to interact with their programs and data (as with a traditional REPL), but to do so on a Web page that easily can be shared among collaborators or sent via e-mail for off-line usage and analysis. Indeed, I now use the IPython Notebook in all of my Python courses. Not only does it provide me with a high-quality way to display the live coding demos I do during my classes, but I then can send the document to my students, who can replay, modify and better understand what I discussed in class.

Importing Data

The first step of any data science project is to get the data ready. In the case of wanting to analyze Apache logfiles, you might think it's enough just to get the file. However, Pandas—the Python library that I'll be using to analyze the data for this example—is like many other data science systems (for example, the R language) that expects the data to be in CSV (comma-separated values) format. This means you'll need to convert the logfile into a CSV file, in which the fields from the Apache log are converted into fields in CSV.

Performing such a transformation is actually quite straightforward in Python. Here is a sample from the Apache logfile from my blog:


122.179.187.119 - - [22/Jan/2016:11:57:26 +0200] "GET
 ↪/wp-content/uploads/2014/10/3D_book.jpg HTTP/1.1" 200 302222
 ↪"http://blog.lerner.co.il/turning-postgresql-rows-arrays-array/"
 ↪"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like
 ↪Gecko) Chrome/47.0.2493.0 Safari/537.36"
122.179.187.119 - - [22/Jan/2016:11:57:27 +0200] "POST
 ↪/wp-admin/admin-ajax.php HTTP/1.1" 200 571
 ↪"http://blog.lerner.co.il/turning-postgresql-rows-arrays-array/"
 ↪"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like
 ↪Gecko) Chrome/47.0.2493.0 Safari/537.36"
54.193.228.6 - - [22/Jan/2016:11:57:29 +0200] "GET
 ↪/category/python/feed/ HTTP/1.1" 200 25856 "-" "Digg Feed
 ↪Fetcher 1.0 (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1)
 ↪AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1
 ↪Safari/534.48.3)"

Each line has the following components:

IP address from which the request was made.
Two fields (represented with - characters) having to do with authentication.
The timestamp.
The HTTP request, starting with the HTTP request method (usually GET orPOST) and a URL.
The result code, in which 200 represents "OK".
The number of bytes transferred.
The referrer, meaning the URL that the user came from.
The way in which the browser identifies itself.

This information might seem a bit primitive and limited, but you can use it to understand a large number of factors better having to do with visitors to your blog. Note that it doesn't include information that JavaScript-based analytics packages (for example, Google Analytics) can provide, such as session, browser information and cookies. Nevertheless, logfiles can provide you with some good basics.

Two of the first steps of any data science project are 1) importing the data and 2) cleaning the data. That's because any data source will have information that's not really useful or relevant for your purposes, which will throw off the statistics or add useless bloat to the data you're trying to import. Thus, here I'm going to try to read the Apache logfile into Python, removing those lines that are irrelevant. Of course, what is deemed to be "irrelevant" is somewhat subjective; I'll get to that in just a bit.

Let's start with a very simple parsing of the Apache logfile. One of the first things Python programmers learn is how to iterate over the lines of a file:


infile = 'short-access-log'
for line in open(infile):
    print(line)

The above will print the file, one line at a time. However, for this example, I'm not interested in printing it; rather, I'm interested in turning it into a CSV file. Moreover, I want to remove the lines that are less interesting or that provide spurious (junk) data.

In order to create a CSV file, I'm going to use the csv module that comes with Python. One advantage of this module is that it can take any separator; despite the name, I prefer to use tabs between my columns, because there's no chance of mixing up tabs with the data I'm passing.

But, how do you get the data from the logfile into the CSV module? A simple-minded way to deal with this would be to break the input string using the str.split method. The good news is that split will work, at least to some degree, but the bad news is that it'll parse things far less elegantly than you might like. And, you'll end up with all sorts of crazy stuff going on.

The bottom line is that if you want to read from an Apache logfile, you'll need to figure out the logfile format and read it, probably using a regular expression. Or, if you're a bit smarter, you can use an existing library that already has implemented the regexp and logic. I searched on PyPI (the Python Package Index) and found clfparser, a package that knows how to parse Apache logfiles in what's known as the "common logfile format" used by a number of HTTP servers for many years. If the variable linecontains one line from my Apache logfile, I can do the following:


from clfparser import CLFParser
infilename = 'short-access-log'
for line in open(infilename):
    print CLFParser.logDict(line)

In this way, I have turned each line of my logfile into a Python dictionary, with each key-value pair in the dictionary referencing a different field from my logfile's row.

Now I can go back to my CSV module and employ the DictWriter class that comes with it. DictWriter, as you probably can guess, allows you to output CSV based on a dictionary. All you need to do is declare the fields you want, allowing you to ignore some or even to set their order in the resulting CSV file. Then you can iterate over your file and create the CSV.

Here's the code I came up with:


import csv
from clfparser import CLFParser

infilename = 'short-access-log'
outfilename = 'access.csv'

with open(outfilename, 'w') as outfile, open(infilename) as infile:
    fieldnames = ['Referer', 'Useragent', 'b', 'h', 'l', 'r', 's',
     ↪'t', 'time', 'timezone', 'u']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames,
     ↪delimiter='\t')
    writer.writeheader()

    for line in infile:
        writer.writerow(CLFParser.logDict(line))

Let's walk through this code, one piece at a time. It's not very complex, but it does pull together a number of packages and functionality that provide a great deal of power in a small space:

First, I import both the csv module and the CLFParser class from theclfparser module. I'm going to be using both of these modules in this program; the first will allow me to output CSV, and the second will let me read from the Apache logs.
I set the names of the input and output files here, both to clean up the following code a bit and to make it easier to reuse this code later.
I then use the with statement, which invokes what's known as a "context manager" in Python. The basic idea here is that I'm creating two file objects, one for reading (the logfile) and one for writing (the CSV file). When the withblock ends, both files will be closed, ensuring that no data has been left behind or is still in a buffer.
Given that I'm going to be using the CSV module's DictWriter, I need to indicate the order in which fields will be output. I do this in a list; this list allows allow me to remove or reorder fields, should I want to do so.
I then create the csv.DictWriter object, telling it that I want to write data to outfile, using the field names I just defined and using tab as a delimiter between fields.
I then write a header to the file; although this isn't crucial, I recommend that you do so for easier debugging later. Besides, all CSV parsers that I know of are able to handle such a thing without any issues.
Finally, I iterate over the rows of the access log, turning each line into a dictionary and then writing that dictionary to the CSV file. Indeed, you could argue that the final line there is the entire point of this program; everything up to that point is just a preface.

Cleaning the Data

You've now seen that you can import the data from another form into a CSV file, which is one of the most common formats used in data science. However, as I mentioned previously, one of the key things that you also need to do is clean the data; analyzing bogus data will give you bogus results.

So, what sort of data here needs to be cleaned?

One obvious candidate is to remove anything that wasn't a real human. Perhaps you're interested in finding out what Web crawlers, such as those from Google and Yahoo, are up to. But it's more likely that you want to know what humans are doing, which means removing all of those robots.

Of course, this raises the question of how you can know whether a request is coming from a robot. As humans, you can examine the User-agent string and make an educated guess. But given that you're trying to remove all of the robots, and that new ones constantly are being added, something automatic would be better.

There's no perfect answer to this, but for the purposes of this article, I decided to use another Python module from PyPI, albeit one that's a bit out of date—one known as robot-detection. The idea is that you import this module and then use the is_robotfunction on the Useragent field. If it's a robot, is_robot will return True. Here's my revised code:


import csv
from clfparser import CLFParser
from collections import Counter
import robot_detection

infilename = 'medium-access-log.txt'
outfilename = 'access.csv'
robot_count = Counter()

with open(outfilename, 'w') as outfile, open(infilename) as infile:
    fieldnames = ['Referer', 'Useragent', 'b', 'h', 'l', 'r', 's',
     ↪'t', 'time', 'timezone', 'u']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames,
     ↪delimiter='\t')
    writer.writeheader()

    for line in infile:
        d = CLFParser.logDict(line)
        if robot_detection.is_robot(d['Useragent']):
            robot_count[d['Useragent']] += 1
        else:
            writer.writerow(d)

The above code is mostly unchanged from the previous version; the two modifications are that I'm now using robot_detection to filter out the robots, and I'm using the Python Counter class to keep track of how many times each robot is making a request. This alone might be useful information to have—perhaps not now, but in the future. For example, from examining the most recent 100,000 requests to my blog, I found that there were more than 1,000 requests from the "domain re-animator bot", something I hadn't even heard of before.

Given that I'm currently concentrating on user data, filtering out these bot requests made my data more reliable and also a great deal shorter. Out of 100,000 records, only 27,000 were from actual humans.

Conclusion

The first step of any data-analysis project is to import and clean the data. Here, I have taken the data and put it into CSV format, filtering out some of the lines that are of less interest. But this is just the start of my analysis, not its end. Next month, I'll explain how you can import this data into Python's Pandas package and start to analyze the logfile in a number of different ways.

Resources

Data science is a hot topic, and many people have been writing good books on the subject. I've most recently been reading and enjoying an early release of the Python Data Science Handbook by Jake VanderPlas, which contains great information on data science as well as its use from within Python. Cathy O'Neil and Rachel Schutt's slightly older book Doing Data Science is also excellent, approaching problems from a different angle. Both are published by O'Reilly, and both are great reads.

منبع خبر: linuxjournal

برچسب‌ها برنامه نویسی ,