來源:EDX上的課程BigData Fundamentals中Waht is data mining一節(jié)的字幕
部分翻譯參考網(wǎng)易有道詞典的翻譯結(jié)果
由于筆者水平有限祈餐,部分翻譯可能不太準確,還望大家不吝指正扳还。
FRANK NEUMANN: In this video, you will learn some of the basics when considering algorithms for big data.
在這個視頻里才避,你可以學到有關(guān)大數(shù)據(jù)算法的一些基礎(chǔ)知識。
One of the most frequently used approaches to finding out information from big data is through data mining.
從大數(shù)據(jù)中發(fā)現(xiàn)信息的最常用的方式之一就是數(shù)據(jù)挖掘氨距。
But what is data mining?
但是什么是數(shù)據(jù)挖掘呢桑逝?
Most probably you have some ideas on what data mining is about.
很可能你有一些關(guān)于數(shù)據(jù)挖掘是什么的想法。
Think about it for a second.
想一下俏让。
I want to make your notion of data mining precise by providing you the following definition.
我想通過提供下邊的定義來讓你對數(shù)據(jù)挖掘的概念更加精確楞遏。
"Data mining is the discovery of 'models' for data."
“數(shù)據(jù)挖掘是對數(shù)據(jù)中‘模型’的發(fā)現(xiàn)∈孜簦”
A model can be various things, and I would like to show you some examples.
模型可以是各種各樣的東西寡喝,在這里我將向你展示一些例子。
An important example is PageRank by Google, which is used for web search.
一個重要的例子是Google的用于網(wǎng)頁搜索的PageRank沙廉。
It models the importance of web pages in the internet.
它模擬了互聯(lián)網(wǎng)上網(wǎng)頁的重要性。
Pages are connected through links in a web graph.
網(wǎng)頁通過網(wǎng)絡圖中的鏈接被連接起來臼节。
In this model, each page is assigned a value that summarises the importance of that page.
在這個模型里撬陵,每個頁面被指定了一個概括了這個頁面重要性的值。
I will get to this in greater detail later in the course.
我會在接下來的課程里更加詳細地討論這個問題网缝。
But essentially, the number assigned to a specific web page is the probability that a random surfer on the web graph is at this page at any given point in time.
但從本質(zhì)上講巨税,每個特定的網(wǎng)頁被指定的數(shù)字正是網(wǎng)絡圖上一個隨機的上網(wǎng)者在任何給定時間在這個頁面的概率
A high value means that the web page is an important one.
高值意味著這個網(wǎng)頁很重要。
In this example, you can see a graph with five nodes and a page rank assigned to every page.
在這個例子里粉臊,你可以看到一個有著五個節(jié)點的圖和以及每個頁面都被指定了一個頁面級別草添。
It can be observed that the central node, which has many incoming links, gets a higher score than the nodes that have not that many pages pointing to them.
我們可以看到中間節(jié)點(有著許多傳入鏈接),比那些沒有很多頁面指向它們的節(jié)點獲得了更高的分數(shù)扼仲。
In this sense, a page rank of a page is high if it is regarded as important by other pages that link to it.
從這個意義上講远寸,如果一個網(wǎng)頁被鏈接到它的其他頁面認為很重要,則它的頁面級別就會很高屠凶。
Having the importance of web pages modelled in this way - one score per page - allows algorithms to judge the importance of pages as part of an internet search engine.
以這樣的方式來模擬網(wǎng)頁的重要性—一個頁面對應一個分數(shù)—允許算法去判斷作為互聯(lián)網(wǎng)搜索引擎的一部分的網(wǎng)頁的重要性涣觉。
As another example of data model, I'd like to look at the performance of algorithms.
作為另一個有關(guān)數(shù)據(jù)模型的例子烙心,我想看看一下算法的性能。
Let's say you have designed an algorithm and have run it on various benchmarks.
比如說你設計了一個算法,并在各種基準上運行了它合蔽。
To make things simple, assume that all benchmark instances have the same size - let's say graphs of 10,000 nodes.
簡單來說奇昙,假定所有基準實例都同樣大—比如說有著10000節(jié)點的圖。
The runtime of your algorithm on one particular benchmark instance would be measured in milliseconds.
算法在一個特定基準實例上的運行時間將以毫秒位單位測量。
Running your algorithm on each benchmark instance gives you a set of numbers characterising the runtime behaviour.
在每個基準實例上運行算法可以得出一組描述運行時行為的數(shù)字犯犁。
If you have done many runs of your algorithm, you might want to have a more compact model of your runtime data.
如果你將算法運行了很多遍,你可能會想要一個更緊湊的用于運行時數(shù)據(jù)模型女器。
If you assume that the data comes from a Gaussian distribution, then the model can be given by the average of these numbers together with its standard deviation.
如果假設數(shù)據(jù)來自高斯分布酸役,那么模型可以由這些數(shù)字的平均值及其標準差給出。
This chart shows different Gaussian in distributions - all have mean 100.
這張圖顯示了不同的高斯分布—它們的平均值都是100
Three distributions having standard deviation 5, 10, and 15 are shown.
其中三個分布的標準差分別是5晓避,10和15簇捍。
You can observe that the distributions have higher concentration around the mean if the standard deviation is small.
你可以看到如果分布的標準差更小,它在平均值附近就會更集中俏拱。
If your data consists of a lot of data points - let's say a billion - in a particular application, then describing your data by just two parameters, average and standard deviation, gives a very compact description.
如果在一個特定的應用中暑塑,你的數(shù)據(jù)中包含了大量的數(shù)據(jù)點(比如說一百萬個),那么只需要平均值和標準差這兩個參數(shù)來描述數(shù)據(jù)就可以得出一個非常緊湊的描述锅必。
Having a compact description of your data is useful because you can easily implement algorithms that exploit the model.
有一個緊湊的數(shù)據(jù)描述是有用的事格,因為你可以輕松地實現(xiàn)利用這個模型的算法。
You will see this throughout the course.
你將在整個課程中看到這一點搞隐。
Next, you will be looking at how to make data mining models more precise and discuss different ways of modelling data in detail.
接下來驹愚,你將看到如何讓數(shù)據(jù)挖掘模型更精確,并詳細討論不同的數(shù)據(jù)建模方式劣纲。