Analyzing Data

Analyzing Data

In this article, I start an exploration of data science using Python

My first Web-related job was in 1995, developing Web applications for a number of properties at Time Warner. When I first started there, we had a handful of programmers and managers handling all of the tasks. But over time, as happens in all growing companies and organizations, we started to specialize. One of the developers took charge of the logfiles—storing them and then performing some basic analysis on them.

Although I recognized that this work was important, it took years for me to realize that in some ways, his job was more important to the business than the applications that I was writing. The developer who worked on these logfiles, and who analyzed them for our bosses, made it possible to know who was using our system, what they were viewing and using, where they came from, and what correlations we could find among the different data points provided by the logs. Sure, we were providing the content and the applications that brought people to the site, but it was the person analyzing the logfiles who was ensuring that our work was paying for itself and meeting our business goals.

During the past decade, I've come to appreciate the need for such analysis even more, as the Web has exploded in popularity, as businesses have learned to use such data to increase profitability and as data science has become a growing field. We're now drowning in data, and being able to make sense of it using analytical tools and libraries is more important than ever.

In this article, I start an exploration of data science using Python, and how you can take something as ordinary as an Apache logfile and extract information from it to understand your visitors better and what they do. In upcoming articles, I plan to cover how you can use data science methods to analyze this logfile in a number of different ways, gaining insights into the raw data it provides and answering questions about your Web application. I'll describe how this analysis also can be presented to your managers and clients, providing powerful visualizations of the analysis you've performed.

Data Science and Python

I studied something called "learning sciences" in graduate school. While I was there, we often would joke that any discipline that includes the word "science" in its name is probably not a real science. Regardless of whether data science is a "real" science, it is a large, important and growing field—one that allows businesses to make decisions based on the data they have gathered. The more data, and the more intelligently you use that data, the better you'll be able to predict your users' and customers' wants and needs.

Data science has been defined loosely as the intersection of programming and statistics, applied to a particular domain. You gather some data and then use statistical methods to answer questions the data might be able to answer. A background in statistics can be helpful, not only because it'll show you the types of analysis you might want to apply, but also because it gives you a healthy sense of skepticism regarding the correlations you find. Did you really discover that your Web site is popular only with people in a particular area of the world? Or, did you just advertise it heavily in one part of the world, influencing who is more likely to visit?

You can start a data science project by asking a question, or you can start to explore the data in a variety of ways, hoping you will find an interesting correlation. Regardless, data science expects you to know a variety of methods from which you can choose one or more that are appropriate for answering your questions. You then apply the methods, using statistical tests to determine whether your answers are significant—that is, whether they merely could have been random.

Python, long used by system administrators, Web developers and researchers, is an increasingly popular choice among people working in data science. This is the result of several factors coming together. First, Python has a famously shallow learning curve, allowing non-programmers to get started and do things in a short amount of time.

Second, Python works easily and cleanly with a variety of data formats and databases. Thus, whether your raw data is in a text file, relational database, NoSQL database, CSV file, Excel file or something more unusual, the odds are very good that Python will be able to read from it easily and quickly.

Third, a number of libraries for analyzing data in Python, such as NumPy, SciPy and Matplotlib, have been under development for many years, providing a terrific balance of usability, expressive power and high-efficiency execution. In recent years, the Pandas library has added an even more useful layer on top of this.

Finally, the development of IPython, now known as Jupyter, has been nothing short of revolutionary, providing developers and data scientists with the ability to interact with their programs and data (as with a traditional REPL), but to do so on a Web page that easily can be shared among collaborators or sent via e-mail for off-line usage and analysis. Indeed, I now use the IPython Notebook in all of my Python courses. Not only does it provide me with a high-quality way to display the live coding demos I do during my classes, but I then can send the document to my students, who can replay, modify and better understand what I discussed in class.

Importing Data

The first step of any data science project is to get the data ready. In the case of wanting to analyze Apache logfiles, you might think it's enough just to get the file. However, Pandas—the Python library that I'll be using to analyze the data for this example—is like many other data science systems (for example, the R language) that expects the data to be in CSV (comma-separated values) format. This means you'll need to convert the logfile into a CSV file, in which the fields from the Apache log are converted into fields in CSV.

Performing such a transformation is actually quite straightforward in Python. Here is a sample from the Apache logfile from my blog:


122.179.187.119 - - [22/Jan/2016:11:57:26 +0200] "GET
 ↪/wp-content/uploads/2014/10/3D_book.jpg HTTP/1.1" 200 302222
 ↪"http://blog.lerner.co.il/turning-postgresql-rows-arrays-array/"
 ↪"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like
 ↪Gecko) Chrome/47.0.2493.0 Safari/537.36"
122.179.187.119 - - [22/Jan/2016:11:57:27 +0200] "POST
 ↪/wp-admin/admin-ajax.php HTTP/1.1" 200 571
 ↪"http://blog.lerner.co.il/turning-postgresql-rows-arrays-array/"
 ↪"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like
 ↪Gecko) Chrome/47.0.2493.0 Safari/537.36"
54.193.228.6 - - [22/Jan/2016:11:57:29 +0200] "GET
 ↪/category/python/feed/ HTTP/1.1" 200 25856 "-" "Digg Feed
 ↪Fetcher 1.0 (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1)
 ↪AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1
 ↪Safari/534.48.3)"

 

منبع خبر: linuxjournal


0 نظر درباره‌ی این پست نوشته شده است.

ثبت نظر