SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
BBS Crawler
     for Taiwan

bsdconv + pyte + telnetlib


 by Buganini @ PyHUG
      Sep. 2012
Obstacles
●   Big5/UAO
●   Segmented Big5
●   Control Sequence
●   Ambiguous Width
●   Big5/UAO           Gov.tw: BIG5-2003

●   Segmented Big5     Windows: CP950
●   Control Sequence   Libiconv: BIG5(?), CP950, BIG5-HKSCS,
                          BIG5-HKSCS:2004, BIG5-HKSCS:2001,
●   Ambiguous Width       BIG5-HKSCS:1999, BIG5-2003 (experimental)

                       Mozilla: UAO 2.41

                       BBS: UAO 2.50(?)

                                etc..   ref: http://moztw.org/docs/big5/


                       UAO
                          == Unicode At Once
                          == Unicode 補完計畫
                          != Unicode

                       UAO
                          is extended Big5 (by using PUA),
                          including Chinese (trad/sim/hk), Japanese, Cyrillic

                          Ex: 喆 (95ED), 轮 (8879), Я(C854), か (C6F1)
Big5/UAO
                       xAExE1
●



●   Segmented Big5
●   Control Sequence   xAE
●   Ambiguous Width    x1B[1;33m
                       xE1

                             PCMAN

                       Standard Tool
●   Big5/UAO
●   Segmented Big5
●   Control Sequence
●   Ambiguous Width




                       08 08 20 20   ← ← SP SP
                       08 08 0a      ←←↓
                       e2 97 8f      ●
●   Big5/UAO
●   Segmented Big5
●   Control Sequence
●   Ambiguous Width
Obstacles
                                             Not anymore…

●   Big5/UAO
●   Segmented Big5                    Solved in bug5, using bsdconv

●   Ambiguous Width
●   Control Sequence                  Solved, using pyte




https://github.com/buganini/bug5

https://github.com/buganini/bsdconv

https://github.com/selectel/pyte
bsdconv                           (1/4)
import bsdconv

bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


                                 xAExE1xAEx1B[1;33mxE1
                         ---------------------------------------------------------
                             AE E1 AE 1B 5B 31 3B 33 33 6D E1

     ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★
                                                                                     Bsdconv Internal Prefix:
                          03AE 03E1 03AE 1B5B313B33336D 03E1                         03: Byte
                                                                                     1B: ANSI Control Sequence
     ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★

                          03AE 03E1 03AE 03E1 1B5B313B33336D


   ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★

                             AE E1 AE E1 1B 5B 31 3B 33 33 6D

     ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★

                              016851 016851 1B5B313B33336D                           #U+6851 == 桑
bsdconv                      (2/4)
 import bsdconv

 bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


>>> c=bsdconv.Bsdconv("ansi-control,byte:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
03AE
03E1
03AE
1B5B313B33336D ( FREE )
03E1

>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
03AE
03E1
03AE
03E1
1B5B313B33336D ( FREE )
Bsdconv Internal Prefix:
03: Byte
1B: ANSI Control Sequence
bsdconv                      (3/4)
 import bsdconv

 bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
   pass:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
AE
E1
AE
E1
1B5B313B33336D ( FREE SKIP )

>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
   skip,big5:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
016851
016851
1B5B313B33336D ( FREE )
Bsdconv Internal Prefix:
01: Unicode
1B: ANSI Control Sequence

#U+6851 == 桑
bsdconv                      (4/4)
import bsdconv

bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
   skip,big5:utf-8,bsdconv_raw")
>>> s=c.conv("xAExE1xAEx1B[1;33mxE1")

>>> s
'xe6xa1x91xe6xa1x91x1b[1;33m'

>>> s.decode("utf-8")
u'u6851u6851x1b[1;33m'




#U+6851 == 桑
_
                                                                           | |

                                    pyte       (1/2)
                                                           _ __    _   _ | |_    ___
                                                          | '_  | | | || __|/ _ 
                                                          | |_) || |_| || |_|     __/
import pyte
                                                          | .__/   __, | __|___|
stream = pyte.Stream()                                    | |      __/ |
                                                          |_|      |___/
screen = pyte.Screen(80, 24)
                                                          Python Terminal Emulator
screen.mode.discard(pyte.modes.LNM)

stream.attach(screen)

seq=SEQUENCE_FROM_SERVER

useq=c.conv(seq)

stream.feed(useq.decode("utf-8"))

RESULT_SCREEN="n".join(screen.display).encode("utf-8")




 With pyte.modes.LNM:
 r → CR+LF (CarriageReturn / LineFeed)
 Without pyte.modes.LNM:
 r → CR
pyte           (2/2)
                                   #Ambiguous Width
screens.py

width_counter=bsdconv.Bsdconv("utf-8:width:null")
telnetlib           (1/3)




What's wrong with read_until/expect?
  What telnetlib does:
    Server → telnetlib connection→ telnetlib.read_until

  What I need:
    Server → telnetlib connection → bsdconv → telnetlib.read_until
    Regular Expression

Solutions:
  a) Implement bsdconv → telnetlib.read_until (current)
  b) Hack telnetlib (maybe cleaner)
  c) Other telnetlib implementation?
telnetlib             (2/3)
                    #Deal with lagging/noop
def term_comm(feed=None, wait=None):
   if feed!=None:
        conn.write(feed)
        if wait:
            s=conn.read_some()
            s=conv.conv_chunk(s)
            stream.feed(s.decode("utf-8"))
   if wait!=False:
        time.sleep(0.1)
        s=conn.read_very_eager()
        s=conv.conv_chunk(s)
        stream.feed(s.decode("utf-8"))
   ret="n".join(screen.display).encode("utf-8")
   return ret

       Reading                   Feed                     No Feed
     Wait=None               Non-blocking               Non-blocking
      Wait=True                Blocking             Non-blocking (unused)
     Wait=False                   No                         No
telnetlib            (3/3)
                  #Deal with lagging/noop
Action with or without screen refresh
   term_comm('Action A', False)
   term_comm('Action B', True)
   #Action A+B cause screen refresh

Action with screen refresh (important content)
   term_comm('Action', True)

Action with screen refresh
   term_comm('Action')

Wait+Retry



      Reading                 Feed                     No Feed
    Wait=None             Non-blocking               Non-blocking
     Wait=True               Blocking            Non-blocking (unused)
     Wait=False                No                         No
- Demo -
- End -

Más contenido relacionado

La actualidad más candente

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Hsien-Hsin Sean Lee, Ph.D.
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Mr. Vengineer
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumlercorehard_by
 
assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUEducation
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvmChangWoo Min
 
verilog code for logic gates
verilog code for logic gatesverilog code for logic gates
verilog code for logic gatesRakesh kumar jha
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)Selomon birhane
 

La actualidad más candente (20)

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Quiz 9
Quiz 9Quiz 9
Quiz 9
 
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
Ch9c
Ch9cCh9c
Ch9c
 
Ch9a
Ch9aCh9a
Ch9a
 
Machine Trace Metrics
Machine Trace MetricsMachine Trace Metrics
Machine Trace Metrics
 
Summary of C++17 features
Summary of C++17 featuresSummary of C++17 features
Summary of C++17 features
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumler
 
Stack
StackStack
Stack
 
assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YU
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvm
 
verilog code for logic gates
verilog code for logic gatesverilog code for logic gates
verilog code for logic gates
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Dataflow Analysis
Dataflow AnalysisDataflow Analysis
Dataflow Analysis
 
Ch9b
Ch9bCh9b
Ch9b
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)
 

Similar a BBS crawler for Taiwan

Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentAnne Nicolas
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Day2 Verilog HDL Basic
Day2 Verilog HDL BasicDay2 Verilog HDL Basic
Day2 Verilog HDL BasicRon Liu
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Community
 
Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)yang firo
 
Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)yang firo
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Windbg랑 친해지기
Windbg랑 친해지기Windbg랑 친해지기
Windbg랑 친해지기Ji Hun Kim
 
Verilog Lecture4 2014
Verilog Lecture4 2014Verilog Lecture4 2014
Verilog Lecture4 2014Béo Tú
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)Wang Hsiangkai
 
lecture8_Cuong.ppt
lecture8_Cuong.pptlecture8_Cuong.ppt
lecture8_Cuong.pptHongV34104
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days
 
Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Dev_Events
 
淺入淺出 GDB
淺入淺出 GDB淺入淺出 GDB
淺入淺出 GDBJim Chang
 
Bytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreterBytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreterakaptur
 
Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014Béo Tú
 
리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3Sangho Park
 

Similar a BBS crawler for Taiwan (20)

Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Performance tests - it's a trap
Performance tests - it's a trapPerformance tests - it's a trap
Performance tests - it's a trap
 
Day2 Verilog HDL Basic
Day2 Verilog HDL BasicDay2 Verilog HDL Basic
Day2 Verilog HDL Basic
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)
 
Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Windbg랑 친해지기
Windbg랑 친해지기Windbg랑 친해지기
Windbg랑 친해지기
 
Verilog Lecture4 2014
Verilog Lecture4 2014Verilog Lecture4 2014
Verilog Lecture4 2014
 
Operating System
Operating SystemOperating System
Operating System
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)
 
lecture8_Cuong.ppt
lecture8_Cuong.pptlecture8_Cuong.ppt
lecture8_Cuong.ppt
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...
 
淺入淺出 GDB
淺入淺出 GDB淺入淺出 GDB
淺入淺出 GDB
 
Bytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreterBytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreter
 
Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014
 
리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3
 
Ansible 2.0 spblug
Ansible 2.0 spblugAnsible 2.0 spblug
Ansible 2.0 spblug
 

Último

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

BBS crawler for Taiwan

  • 1. BBS Crawler for Taiwan bsdconv + pyte + telnetlib by Buganini @ PyHUG Sep. 2012
  • 2. Obstacles ● Big5/UAO ● Segmented Big5 ● Control Sequence ● Ambiguous Width
  • 3. Big5/UAO Gov.tw: BIG5-2003 ● Segmented Big5 Windows: CP950 ● Control Sequence Libiconv: BIG5(?), CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, ● Ambiguous Width BIG5-HKSCS:1999, BIG5-2003 (experimental) Mozilla: UAO 2.41 BBS: UAO 2.50(?) etc.. ref: http://moztw.org/docs/big5/ UAO == Unicode At Once == Unicode 補完計畫 != Unicode UAO is extended Big5 (by using PUA), including Chinese (trad/sim/hk), Japanese, Cyrillic Ex: 喆 (95ED), 轮 (8879), Я(C854), か (C6F1)
  • 4. Big5/UAO xAExE1 ● ● Segmented Big5 ● Control Sequence xAE ● Ambiguous Width x1B[1;33m xE1 PCMAN Standard Tool
  • 5. Big5/UAO ● Segmented Big5 ● Control Sequence ● Ambiguous Width 08 08 20 20 ← ← SP SP 08 08 0a ←←↓ e2 97 8f ●
  • 6. Big5/UAO ● Segmented Big5 ● Control Sequence ● Ambiguous Width
  • 7. Obstacles Not anymore… ● Big5/UAO ● Segmented Big5 Solved in bug5, using bsdconv ● Ambiguous Width ● Control Sequence Solved, using pyte https://github.com/buganini/bug5 https://github.com/buganini/bsdconv https://github.com/selectel/pyte
  • 8. bsdconv (1/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") xAExE1xAEx1B[1;33mxE1 --------------------------------------------------------- AE E1 AE 1B 5B 31 3B 33 33 6D E1 ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ Bsdconv Internal Prefix: 03AE 03E1 03AE 1B5B313B33336D 03E1 03: Byte 1B: ANSI Control Sequence ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 03AE 03E1 03AE 03E1 1B5B313B33336D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ AE E1 AE E1 1B 5B 31 3B 33 33 6D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 016851 016851 1B5B313B33336D #U+6851 == 桑
  • 9. bsdconv (2/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") >>> c=bsdconv.Bsdconv("ansi-control,byte:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") 03AE 03E1 03AE 1B5B313B33336D ( FREE ) 03E1 >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") 03AE 03E1 03AE 03E1 1B5B313B33336D ( FREE ) Bsdconv Internal Prefix: 03: Byte 1B: ANSI Control Sequence
  • 10. bsdconv (3/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| pass:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") AE E1 AE E1 1B5B313B33336D ( FREE SKIP ) >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") 016851 016851 1B5B313B33336D ( FREE ) Bsdconv Internal Prefix: 01: Unicode 1B: ANSI Control Sequence #U+6851 == 桑
  • 11. bsdconv (4/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:utf-8,bsdconv_raw") >>> s=c.conv("xAExE1xAEx1B[1;33mxE1") >>> s 'xe6xa1x91xe6xa1x91x1b[1;33m' >>> s.decode("utf-8") u'u6851u6851x1b[1;33m' #U+6851 == 桑
  • 12. _ | | pyte (1/2) _ __ _ _ | |_ ___ | '_ | | | || __|/ _ | |_) || |_| || |_| __/ import pyte | .__/ __, | __|___| stream = pyte.Stream() | | __/ | |_| |___/ screen = pyte.Screen(80, 24) Python Terminal Emulator screen.mode.discard(pyte.modes.LNM) stream.attach(screen) seq=SEQUENCE_FROM_SERVER useq=c.conv(seq) stream.feed(useq.decode("utf-8")) RESULT_SCREEN="n".join(screen.display).encode("utf-8") With pyte.modes.LNM: r → CR+LF (CarriageReturn / LineFeed) Without pyte.modes.LNM: r → CR
  • 13. pyte (2/2) #Ambiguous Width screens.py width_counter=bsdconv.Bsdconv("utf-8:width:null")
  • 14. telnetlib (1/3) What's wrong with read_until/expect? What telnetlib does: Server → telnetlib connection→ telnetlib.read_until What I need: Server → telnetlib connection → bsdconv → telnetlib.read_until Regular Expression Solutions: a) Implement bsdconv → telnetlib.read_until (current) b) Hack telnetlib (maybe cleaner) c) Other telnetlib implementation?
  • 15. telnetlib (2/3) #Deal with lagging/noop def term_comm(feed=None, wait=None): if feed!=None: conn.write(feed) if wait: s=conn.read_some() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) if wait!=False: time.sleep(0.1) s=conn.read_very_eager() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) ret="n".join(screen.display).encode("utf-8") return ret Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No
  • 16. telnetlib (3/3) #Deal with lagging/noop Action with or without screen refresh term_comm('Action A', False) term_comm('Action B', True) #Action A+B cause screen refresh Action with screen refresh (important content) term_comm('Action', True) Action with screen refresh term_comm('Action') Wait+Retry Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No