目前分類：資料科學 (3)

瀏覽方式：標題列表簡短摘要

Oct 02 Wed 2019 09:51
資料分析常用程式收集

(1) 一個基本的範例擷取檔案且儲存至csv

from collections import OrderedDict

import requests
import pandas as pd
from bs4 import BeautifulSoup


class HTMLTableParser:

    def get_html_tables_from_resp(self, html_text):
        soup = BeautifulSoup(html_text, 'html.parser')
        tables = soup.find_all('table')
        return tables

    def parse_html_table(self, table):
        """
        <tr>
            <th align="center" class="tblHead" nowrap="" rowspan="2">產業類別</th>
            <th align="center" class="tblHead" nowrap="" rowspan="2">公司代號</th>
            <th align="center" class="tblHead" nowrap="" rowspan="2">公司名稱</th>
            <th align="center" class="tblHead" colspan="4" nowrap="">非擔任主管職務之<br/>全時員工資訊</th>
            <th align="center" class="tblHead" colspan="2" nowrap="">同業公司資訊</th>
            <th align="center" class="tblHead" colspan="4" nowrap="">薪資統計情形</th>
        </tr>
        <tr>
            <th align="center" class="tblHead" nowrap="">員工薪資總額(仟元)</th>
            <th align="center" class="tblHead" nowrap="">員工人數-加權平均(人)</th>
            <th align="center" class="tblHead" nowrap="">員工薪資-平均數(仟元/人)</th>
            <th align="center" class="tblHead" nowrap="">每股盈餘(元/股)</th>
            <th align="center" class="tblHead" nowrap="">員工薪資-平均數(仟元/人)</th>
            <th align="center" class="tblHead" nowrap="">平均每股盈餘(元/股)</th>
            <th align="center" class="tblHead" nowrap="">非經理人之<br/>全時員工薪資<br/>平均數未達50萬元</th>
            <th align="center" class="tblHead" nowrap="">公司EPS獲利表現較同業為佳<br/>，惟非經理人之全時員工<br/>薪資平均數低於同業水準</th>
            <th align="center" class="tblHead" nowrap="">公司EPS較前一年度成長<br/>，惟非經理人之全時員工<br/>薪資平均數較前一年度減少</th>
            <th align="center" class="tblHead" nowrap="">公司經營績效與員工薪酬<br/>之關聯性及合理性說明</th>
        </tr>
        <tr>
            <td nowrap="" style="text-align:left !important;">資訊服務業</td>
            <td nowrap="" style="text-align:left !important;">8416</td>
            <td nowrap="" style="text-align:left !important;">實威</td>
            <td nowrap="" style="text-align:right !important;"> 158,636 </td>
            <td nowrap="" style="text-align:right !important;"> 186 </td>
            <td nowrap="" style="text-align:right !important;"> 853 </td>
            <td nowrap="" style="text-align:right !important;"> 9.69 </td>
            <td nowrap="" style="text-align:right !important;"> 807 </td>
            <td nowrap="" style="text-align:right !important;"> 1.20 </td>
            <td nowrap="" style="text-align:right !important;"></td>
            <td nowrap="" style="text-align:right !important;"></td>
            <td nowrap="" style="text-align:right !important;"></td>
            <td nowrap="" style="text-align:left !important;"><br/></td>
        </tr>
        """
        parsed_data = []

        # Find number of rows and columns
        # we also find the column titles if we can
        table_row_tags = table.find_all('tr')
        table_header_tags = table.find_all('th')
        column_names = [table_header_tag.get_text() for key, table_header_tag in enumerate(table_header_tags) if key not in (3, 4, 5)]
        column_names[7] = '同業公司{}'.format(column_names[7])
        column_names[8] = '同業公司{}'.format(column_names[8])

        tr_td_tags = [
            [td_tag.get_text().strip() for td_tag in table_row.find_all('td')]
            for table_row in table_row_tags if table_row.find_all('td')
        ]

        parsed_data = [
            OrderedDict({
                column_names[index]: td_tag
                for index, td_tag in enumerate(tr_td_tag)
            })
            for tr_td_tag in tr_td_tags
        ]

        df = pd.DataFrame.from_dict(parsed_data)

        return df


htlm_parser = HTMLTableParser()

payload = {
    # 'encodeURIComponent': 1,
    'step': 1,
    'firstin': 1,
    'TYPEK': 'sii', # sii 上市 / otc 上櫃
    'RYEAR': 107,
    'code': '',
}

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

resp = requests.post('https://mops.twse.com.tw/mops/web/ajax_t100sb15', data=payload, headers=headers, timeout=2)

html_tables = htlm_parser.get_html_tables_from_resp(resp.text)

df_table = htlm_parser.parse_html_table(html_tables[0])

df_table.to_csv('107_{}.csv'.format(payload['TYPEK']), index=False, encoding='utf-8')

print(df_table)

參考資料:
(1) 使用 Python 資料分析和視覺化上市櫃公司薪資公開資料

stanley 發表在痞客邦留言(0) 人氣()

個人分類：資料科學

▲top

Feb 22 Fri 2019 10:29
資料科學自我學習

數據科學（英語：data science），又稱資料科學，是一門利用數據學習知識的學科，其目標是通過從數據中提取出有價值的部分來生產數據產品^[1]。它結合了諸多領域中的理論和技術，包括應用數學、統計、模式識別、機器學習、數據可視化、數據倉庫以及高性能計算。數據科學通過運用各種相關的數據來幫助非專業人士理解問題。數據科學技術可以幫助我們如何正確的處理數據並協助我們在生物學、社會科學、人類學等領域進行研究調研。此外，數據科學也對商業競爭有極大的幫助。

資料科學學習路徑

ã€Œdata science learning pathã€çš„åœ–ç‰‡æœå°‹çµæžœ

資料科學學科

資料科學自學工具

ã€Œdata science pythonã€çš„åœ–ç‰‡æœå°‹çµæžœ

資料科學自學參考網站

(1)How to Improve Your Domain Knowledge

1. Research – I once worked on a project that involved the development of an MRP application. At the start of the project, I felt overwhelmed and completely out of my comfort zone. I knew I had to get up to speed and fast. I downloaded as many papers and case studies as I could find online and read up on what other businesses in the industry were doing. It took time and sweat but it eventually paid off. I eventually started “talking the talk”.

2. Interview Key Stakeholders – Arrange to meet with subject matter experts. This meeting can be an informal meeting in a café or a workshop setting. Stakeholders are usually open to providing information if they know the benefits that are likely to come their way if they do. The more knowledgeable the stakeholder, the better. Start with open-ended questions before asking specific questions. The more open-ended the questions are, the more you will learn.
To learn, you need to ask questions. The quality of follow-up questions you can ask however, depends largely on how well you understand the domain – a typical chicken and egg situation.Note that there's a limit to how many fundamental questions about the domain you can ask stakeholders. They may start to lose confidence in your abilities if they have to start explaining the “basic” principles of the domain to you. 3. Keep a Knowledge Book – For any project I start in a new domain, I keep what I like to call a “Knowledge Book”. In this excel workbook, I enter all the information I have gathered about the domain – process information, participants, facts, definitions and acronyms related to the domain. This book becomes my faithful companion and ever-reliable reference guide.

4. Immerse Yourself – Immerse yourself in the domain by talking to users and subscribing to newsletters that provide industry updates. Be in a state of continually seeking the knowledge you need to deliver the expected benefits. For example, if you're in the insurance industry, subscribe to websites like Insurance Times for regular updates.

5. Gain Industry Certification - This is a longer and perhaps tougher route and you may not have enough time if you are working on a fast-paced project. This route is however, preferable if you would like to have a deep knowledge of a specific business area. The CPA accreditation for example, is designed for accountants, CEHA for real estate and CLP for Logistics Professionals, to mention a few. Decide what level of knowledge you would like to attain and find out how much experience you need to proceed with the certification.

6. Find Training Courses: Take advantage of free courses online that teach fundamental concepts in specific domains. Here's an article on this: Improve Your Domain Knowledge With This List Of Free Courses.

(2)Understanding the Business Domain

Here are some thoughts about the questions to ask and answer to get a high-level understanding of how a business works.

Who is the customer?
What is the product?
How is the product sold? (online, phone, in-person sales)
What is the cost structure of the product? (subscription, pay-per-unit, etc)
Where does the money go?
How do we fulfill or distribute the product?
How do we produce or service the product?
How do we support the customer?
How do we market the product?
What partners do we work with to do business? How do we manage these relationships?
What information does the organization manage? Who is responsible for creating, updating, and retiring information? What systems are used to manage the information

(3) 8 Proven Ways To Increase Your Domain Knowledge In A Project

1. Research What Knowledge To Build
2. Ask Questions
3. Sign Up For A Course
4. Attend Industry Events
5. Build A Knowledge Base
6. Join A Knowledge Community
7. Find A Mentor
8. Offer Your Services

stanley 發表在痞客邦留言(0) 人氣()

個人分類：資料科學

▲top

Nov 22 Thu 2018 16:55
web app dash & plotly for iot data visualization

https://dash.plot.ly/ (user giude)
https://dash.plot.ly/?_ga=2.103650308.589841288.1542985557-762329219.1542985557
https://thingsmatic.com/2016/07/10/a-web-app-for-iot-data-visualization/
https://medium.com/@plotlygraphs/introducing-dash-5ecf7191b503
https://www.tutorialspoint.com/r/r_scatterplots.htm (R software)
http://www.math.nsysu.edu.tw/~lomn/homepage/R/R_plot.htm (R software)
https://alysivji.github.io/reactive-dashboards-with-dash.html (frameware,good)

You can check out the code yourself across a few repositories:

Dash backend: https://github.com/plotly/dash
Dash frontend: https://github.com/plotly/dash-renderer
Dash core component library: https://github.com/plotly/dash-core-components
Dash HTML component library: https://github.com/plotly/dash-html-components
Dash component archetype (React-to-Dash toolchain): https://github.com/plotly/dash-components-archetype
Dash docs and user guide: https://github.com/plotly/dash-docs, hosted at https://plot.ly/dash
Plotly.js — the graphing library used by Dash: https://github.com/plotly/plotly.js

#-----------------------------------------

In order to start using Dash, we have to install several packages.

The core dash backend.
Dash front-end
Dash HTML components
Dash core components
Plotly

(1) Install Dask (https://dash.plot.ly)
In your terminal, install several dash libraries. These libraries are under active development, 
so install and upgrade frequently. Python 2 and 3 are supported. 
sudo pip install dash  # The core dash backend
sudo pip install dash-html-components # HTML components
sudo pip install dash-core-components  # Supercharged components
sudo pip install dash-table  # Interactive DataTable component (new!)
sudo pip install plotly --upgrade
sudo pip install dash_table_experiments

(2) python HelloDash.py

(1) Hello.py

# -*- coding: utf-8 -*-
import dash
import dash_core_components as dcc
import dash_html_components as html
app = dash.Dash()
colors = {
'background': '#111111',
'text': '#7FDBFF'
}
app.layout = html.Div(style={'backgroundColor': colors['background']}, children=[
html.H1(
children='Hello Dash',
style={
'textAlign': 'center',
'color': colors['text']
}
),
html.Div(children='Dash: A web application framework for Python.', style={
'textAlign': 'center',
'color': colors['text']
})
])
if __name__ == '__main__':
app.run_server(debug=True,port=8080)

#------------------

(2)HelloDash.py

# -*- coding: utf-8 -*-
import dash
import dash_core_components as dcc
import dash_html_components as html

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

app.layout = html.Div(children=[
    html.H1(children='Hello Dash'),
    html.Div(children='''Dash: A web application framework for Python.'''),
    dcc.Graph(
        id='example-graph',
        figure={
            'data': [
                {'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'SF'},
                {'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': u'Montréal'},

],
            'layout': {
                'title': 'Dash Data Visualization'
            }
        }
    )
])

if __name__ == '__main__':
app.run_server(debug=True,port=8080')

最後一個份如何將DASH deploy to google cloud

Deploying Dash to Google App Engine

stanley 發表在痞客邦留言(0) 人氣()